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SPECIAL INVITED PAPER 
ON THE CONSISTENCY OF BAYES ESTIMATES 


By Pers Draconis! AND DAVID FREEDMAN” 
Stanford University and University of California, Berkeley 


We discuss frequency properties of Bayes rules, paying special attention 
to consistency. Some new and fairly natural counterexamples are given, 
involving nonparametric estimates of location. Even the Dirichlet prior can 
lead to inconsistent estimates if used too aggressively. Finally, we discuss 
reasons for Bayesians to be interested in frequency properties of Bayes rules. 
As a part of the discussion we give a subjective equivalent to consistency and 
compute the derivative of the map taking priors to posteriors. 


1. Consistency of Bayes rules. One of the basic problems in statistics can 
be put this way. Data is collected following a probability model with unknown 
parameters; the parameters are to be estimated from the data. Often, there is 
prior information about the parameters, for example, their probable sign or order 
of magnitude. Many statisticians express such information in the form of a prior 
probability over the unknown parameters. Estimates based on prior probabilities 
will be called Bayes estimates in what follows. 

This paper studies points of contact between frequentist and Bayesian statis- 
tics. We derive frequency properties of Bayes estimates and suggest a Bayesian 
interpretation for some frequentist computations. Our main concern is con- 
sistency: as more and more data are collected, will the Bayes estimates converge 
to the true value for the parameters? 

If the underlying probability mechanism has only a finite number of possible 
outcomes (tossing a coin or die) and the prior probability does not exclude the 
true parameter values as impossible, it has long been known that Bayes estimates 
are consistent. As will be discussed below, if the underlying mechanism allows an 
infinite number of possible outcomes (e.g., estimation of an unknown probability 
on the integers), Bayes estimates can be inconsistent: as more and more data 
comes in, some Bayesian statisticians will become more and more convinced of 
the wrong answer. The class of tail-free and Dirichlet priors was introduced to 
insure consistency in such settings. We present examples showing that mechani- 
cal extension of such priors to other very similar settings leads to inconsistent 
estimates. 

In Section 2 we review other points of contact between the mathematics of 
frequentist and Bayesian statistics. In Section 3 we offer two Bayesian uses for 
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2 P. DIACONIS AND D. FREEDMAN 


frequentist computations. The first is a subjective equivalent to consistency 
involving intersubjective agreement. The second uses frequency computations as 
a way of thinking about priors. As part of the discussion, we compute the 
derivative of the map taking priors to posteriors. Mathematical details are given 
in the appendices. 

To define things, consider a family of probabilities {Q,: 8 € ©} on a space Z. 
Write Q? for the infinite product measure on 2” which makes the coordinate 
random variables X,, X,,..., independent with common distribution Q. We 
will assume throughout that 7 and © are Borel subsets of complete separable 
metric spaces. Let be a prior probability on ©. Let P, be the joint distribution 
of the parameter and the data: 


P(A x B) = f QF(B) (48) 


for Borel sets A and B. The posterior is the P,-distribution of the parameter 6 
given the data X,,...,.X,; we denote this by u,(d6|X,,..., X,). The usual 
Bayes estimate is just the mean of the posterior. 

Here are a few typical examples: 

Coin-tossing. % has two points, H and T for heads and tails, respectively. 
The parameter space © can be taken as the unit interval: 8 € © is the probabil- 
ity of a head. Then 2” is the space of sequences of heads and tails; Q7 makes 
the sequence independent; in any position there is chance 6 of getting a head and 
1 — 6 of getting a tail. We call this a “@-coin.” Informally, P, may be described as 
follows: choose @ at random from g, then toss a 6-coin. Of course, the posterior 
p,(d0|X,,..., Xn) has a density with respect to the prior, proportional to the 
likelihood function: 


as(1 ay-s/ | fora - ay" n(a0)}, 


where S is the number of heads and n — S the number of tails. Thus, the Bayes 
estimate is 


fos — a)" alao) fas. - 6)" ulao). 


Rolling adie. This is the same as coin tossing, except that Z = {1, 2, 3, 4,5, 6}, 
and © is the ‘‘6-simplex,” all sequences 6,,..., 6, of length six whose terms are 
nonnegative and add to one: 


6 
6.20 and } 6 =1. 
=l 
Rolling an infinite die. The same, except 2= {1,2,3,...} and © is an 
infinite-dimensional simplex. Despite the superficial similarity, all the paradoxes 
of inconsistency already appear in this case. From our point of view, this is the 
simplest natural example of a nonparametric problem: estimating an unknown 
probability on the positive integers. On the other hand, we are still in the 
dominated case, so the posterior can still be computed in the usual way: u, 
denotes this familiar version of the posterior. 
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The real line. Now £= (— œ, œ), and O is the set of all probabilities on F. 
This is another nonparametric problem: estimating an unknown probability on 
the line. There is an additional complication: © is undominated. The posterior 
still exists but in general there is no explicit formula for it. There will usually be 
many versions of the posterior and this causes some technical difficulties. 

We will use the weak-star topology: if u„ and u are probabilities on ©, then 
u, > » iff {fdu, > {fd for all bounded continuous functions f on ©. We 
denote point mass at 6 by 5,. Thus, u„ — 6, iff u,(U) > 1 for every neighboar- 
hood U of 8. 

In the coin tossing case, © is the unit interval; 4, —> p weak-star if and only if 
2,[0, x] > a[0, x] for all intervals such that »{x} = 0. Turn next to the infinite 
die. Then p,, > p if and only if 


#,{9|0, < x, and 0, < x3... and 8, < xp} 
— p{0|8, < x, and 6, < x3... and 8p < x,} 
for all & and all x’s such that 
p{6|6, = x, or bg = X3... or 0, = Xp} = 0. 
Finally, take the line. Changing the imagery a little, p„ > p if and only if 


k k 
Í, p f f(x) lde) Jud) +i p f fd) ade) ad) 
for all k and all bounded continuous functions f, on F. 


Consistency. The pair (0, p,,) is consistent if for Q7-almost all sequences, the 
posterior », converges to point mass at @ in the weak-star topology. The 
weak-star topology has the fewest open sets of any natural topology, so it is fairly 
easy for posteriors to be consistent. Minor technical difficulties apart, if (@, »,,) is 
consistent in our sense, the Bayes estimate for 6—in the sense of a posterior 
mean—will be consistent too. We say p,, is consistent if (0, u „) is consistent for 
all 6. Notice that consistency depends to some extent on the version selected for 
un. Often, there is only one natural version, as in the dominated case. Sometimes, 
a natural version can be selected on the basis of continuity in the data: see Zabell 
(1979) and Tjur (1980). Sometimes, however, there is no sensible way to resolve 
the ambiguity. When there is a natural version of u„, as in the dominated case, 
we will say that (6, u) is consistent rather than (9, p„). 

In the coin-tossing example, what does it mean for (8, »,,) to be consistent? 
Just that a Bayesian with prior 4, who happens to be tossing a @-coin, will 
eventually find this out: his posterior will concentrate in smaller and smaller 
intervals around @ as more and more data comes in. Likewise for the infinite die. 
More specifically, (8, u) is consistent if and only if for any positive integer k and 
any small positive e, 


BAL N,,|X1,.--,X,} > 1 asn> œ ae. QP, 
where 
N,. = {¢: $ E O and jẹ, — 9,| < efori=1,..., k}. 
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For coin-tossing, (8, #) is consistent iff u assigns positive mass to every open 
interval around 6. For the infinite die the situation is much more 
complicated—and that is what the present discussion is all about. 

Doob (1948) proved a very general theorem on consistency: one implication is 
that (0, »,,) is consistent for p-almost all 0. Thus, a Bayesian with prior p can be 
sure that the posterior will converge. (This does not depend on the version, 
because null sets do not matter here.) Doob’s work has been extended by 
Breiman, Le Cam, and Schwartz (1964). However, a frequentist using a Bayes 
rule will want to know for which @’s the rule is consistent. 

For smooth, finite-dimensional families, (6, u) is consistent if and only if @ is 
in the support of p. See Freedman (1963) or Schwartz (1965). (The support is the 
smallest closed set of probability 1.) But the assumption of finite dimensionality 
is really needed. For example, take % to be the positive integers. Take © to be 
the set of all probabilities on #. Take @ to be the geometric distribution with 
parameter t. Freedman (1963) constructed a prior » with the following proper- 
ties: 


e Every open neighborhood of 0 has positive y-probability. 
e For §°-almost all sequences, the posterior converges to point mass at a 
geometric distribution with parameter 3. 


This example is generic in a topological sense: for most priors u and most 
parameters 6, the pair (9, u) is inconsistent. To make this precise, we need the 
notion of “category.” A set is of the first category if it is contained in a countable 
union of closed, nowhere dense sets. First-category sets are the topological 
equivalents of null sets. Put the weak-star topology on 7(@), the set of priors on 
©. Then the set of consistent pairs (9, u) is of the first category in @ X 7(@). See 
Freedman (1965). 


Tail-free and Dirichlet priors. The existence of such counterexamples sug- 
gests the need for some careful investigation. Are there priors consistent for all 
parameters? (We will call such priors “consistent.”) For countable Z, Freedman 
suggested using tail-free and Dirichlet priors, showing that these priors are 
consistent at all parameters. This work was extended by Fabius (1964), who 
showed how to construct consistent tail-free priors for any complete, separable 
metric X. These ideas were further developed by Ferguson (1973, 1974) with an 
excellent survey of the literature. Other extensions include “neutral-to-the-right 
priors.” See Doksum (1974). 

Here is a brief description of tail-free priors. For definiteness, consider the 
problem of estimating an unknown probability @ on the positive integers. Write 0, 
for the ith coordinate of 0. We visualize the prior as randomly selecting a 
probability on the positive integers. Even more crudely, what the prior does is to 
randomly distribute a total mass of 1 among the integers. Thus, 6, is the 
randomly selected mass assigned to the integer 1, while 6, is the mass assigned to 
2, and so on. 
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The simplest tail-free prior on the integers can be described by “stick-break- 
ing.” Let S,, S,,..., be independent and uniform over [0,1]. Think of a stick of 
unit length. Break off a piece of length @, = S,. This leaves a remaining piece of 
length 1 — 6,. Now break off 6, = S,(1 — @,), @, = S,(1 — @, — 6,), and so forth. 

Freedman (1963) suggested a useful extension: the distribution of any finite 
number of the 6,, say ,,..., 8, can be specified in an essentially arbitrary way. 
Then the prior is completed by stick-breaking; the “cuts” S, are independent but 
not necessarily uniform or even identically distributed. The S, take values in 
(0, 1) and have arbitrary distributions with two restrictions: S, falls into any open 
interval with positive probability and LE(S,) = oo. The cuts are used to distrib- 
ute mass inductively as follows: suppose mass m = r 19, < 1 has been assigned 
and mass 1 — m remains; now mass 6,,, = (1 — m)Sy,, is assigned to N + 1 
and mass 


1- m- (1-m)S8y,, = (1 — m)(1 — Sya) 


is left for the next move, which is carried out using Sy,2, and so on. 

The motivation is as follows. A Bayesian may have a reasonably clear opinion 
about 6, for some t’s. However, it seems unlikely that such an opinion can be 
carefully quantified for all i. Freedman’s extension of stick-breaking allows a 
Bayesian to approximate any prior by one consistent at all parameter values; 
indeed, any prior can be approximated (weak-star) by specifying the distribution 
of a finite number of 8,. 

Early users of “stick-breaking” were Banach (1964), Kahane and Salem (1958), 
and Eberlein (1962). Kahane and Salem studied sums S = E% ,r,X, the X, are 
iid, taking the values 0 or 1 with probability + each; the r, are nonnegative and 
sum to 1. When is the distribution of S absolutely continuous? Kahane and 
Salem prove that for “almost all” sequences (7,, r,,...) the law of S is absolutely 
continuous with an L? density; “almost all” is relative to a stick-breaking 
measure. Banach and Eberlein were interested in a natural integral over the 
function spaces l, and 1, and used stick-breaking as a key ingredient. 

Dirichlet priors are tail-free; the S, having certain beta distributions. Usually, 
Dirichlet priors are parametrized in terms of a finite measure a on the observa- 
tion space X, which is for the moment general. The Dirichlet prior D(a) with 
base measure a can be characterized as follows. Partition the observation space £ 
into a finite number of sets A,, A,,..., Ap. Consider a probability ð on the 
observation space, selected at random from D(a). Then 6(A,),...,9(A,) are 
random variables with respect to D(a). These have a Dirichlet distribution on the 
k-simplex, with parameter vector (a(A,),..., a(A;)). More concretely, these are 
distributed like 

U,/S,...,U,/8, 


where S = U, + --- +U,, the U, being independent gamma variables with shape 
parameter a(A,). When the observation space is the integers, a is specified by 
a countable collection of numbers a,,a2,.... Let |lal| = La,. By assumption, 
lal < œ. For a Dirichlet prior with parameter measure a on the positive 
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integers, the cut S, has a beta distribution with the two parameters a, + --- +a 
and |la|| — (a, + +++ +a,). 

Dirichlet priors have been suggested for use in a wide variety of problems. 
Also, some writers have considered using mixtures of Dirichlet priors; see Antoniak 
(1974) or Dalal and Hall (1980, 1983). It is natural to ask whether mixtures of 
Dirichlet priors are consistent. The first step is easy; any mixture of a finite 
number of consistent priors is consistent. However, in Freedman and Diaconis 
(1983), we showed that a countable mixture of Dirichlet priors can be incon- 
sistent. A starting point of our construction is an example showing that a mixture 
of a Dirichlet prior and a point mass at a certain long-tailed probability is 
inconsistent. Such priors are similar to the ones suggested by Jeffreys (1967) for 
Bayesian hypothesis testing. On the positive side, we showed that if the mass of 
the parameter measures are bounded, then the mixture is consistent. 

The inconsistent priors in Freedman (1963) were constructed with malice 
aforethought. Bayesian reaction seems to be, “Oh, nobody would ever use a prior 
like that.” See, for example, the remarks in Box, Leonard, and Wu (1983, page xi). 
That a Jeffreys-style prior is inconsistent should therefore be of interest. More- 
over, in Diaconis and Freedman (1986), we give examples of priors suggested by 
practicing Bayesians, which turn out to be inconsistent. That paper is fairly 
technical and the following heuristic discussion may be helpful. 


t 


The location problem. Consider estimating an unknown location parameter 8 
with squared error as loss. The observations are modelled as 


X,=6+s, Gass 


The e, are independent disturbance terms with a common distribution G. If G 
has known density g and a prior u is put on 0, then the Bayes estimate is the 
mean of the posterior distribution: 


_ (aT Las, — 8) u(a0) 
oa Oe ate 0) (ad) 


If »(d@) is taken as Lebesgue measure, then this becomes the Pitman estimator. 
If the density g in (1.1) is unknown, it can be estimated from the data. Often, g 
is assumed to belong to some parametric family. Fraser (1976), Box and Tiao 
(1973, Chapters 3 and 4), and Johns (1979) all propose estimators of that general 
type, with a prior distribution on the parameters of the family. Such estimators 
can be inconsistent; the argument is like that in Freedman and Diaconis (1982b) 
or Diaconis and Freedman (1986) as outlined below. 

A nonparametric approach to estimating G is also natural. This involves 
putting a prior on @ and a prior on the law G of ¢,. Dalal (1979a, b) has suggested 
using a Dirichlet prior for G. We will now show that for a Dirichlet prior with a 
Cauchy base measure a, the Bayes estimates are inconsistent. 

To avoid identifiability problems, we will assume that the law G of e is 
symmetric. To put a prior on symmetric G’s, we symmetrize the Dirichlet as 
follows: if G is a distribution function for a random variable X, let G~ be the 
distribution function of — X and let G = }(G + G`), so G is symmetric. Let D, 
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be the law of G, where G has a Dirichlet distribution with base measure a on R. 
This is a symmetrized Dirichlet. The construction was first suggested by Dalal 
(1979a, b). For fiirther discussion see Hannum and Hollander (1983). We assume 
that the base measure « has density a’ and write ||a|| = a(R). We make 8 and G 
independent; our prior for 0 has density f on R while G is chosen from D, 
The posterior distribution of 6 is computed in Lemma 3.1 of Diaconis and 
Freedman (1986). For simplicity, we only discuss the Bayes estimate here. 


THEOREM 1. Suppose that X,,..., X, are all distinct. Let 6,,= 4(X, + X,) 
and suppose 9,, are distinct. The Bayes estimate 0 is 


6, = [0.0(4) ao + Beal [Seos + Eau 


i<j 


with 
w(8) = lal #8) TT (2, a0); 


oy = 3[1(0,)/0¢(8,)] Te Xe 8,,), 
8,, = 4(X, — X,). 


For the next result, take f to be standard normal and a to be Cauchy. We 
suppose that in fact the £, have a density A which is symmetric about 0, with a 
strict maximum at 0; further, A is infinitely differentiable and vanishes off a 
compact interval. 


THEOREM 2. For some h, the Bayes estimate is inconsistent. Indeed, for 
large n, there is probability near 4 that 6, is close to y and probability near 1 
that ê, is close to —y, where y + 0 depends on h. 


REMARKS. (1) Theorem 2 is valid when « is any ¢-distribution. 

(2) When sampling from a continuous density, X, and @,, will be distinct with 
probability 1. 

(3) The Bayes estimate 6 is a convex combination of two other estimates: 


6, = fow(o) ao] foto) ao 
and 


6, aR È 9,,,, L Qy 
I<] i<j 
The first is a Bayes estimate, as at (1.1), when sampling from known density a’. 
The second is a weighted average of the 6,,, which estimate @ from the pair X, 
and X,. If a’ is one of the standard unimodal densities such as normal or Cauchy, 
the weight w,, is relatively large when 0, is in the center of the X,’s. It turns out 
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Fic.1. The counterexample density h 


that 6, dominates 6, as n > oo for some h; for others 6, will dominate. Here, we 
focus on the first case. 

(4) The A constructed in Theorem 2 is trimodal, as in Figure 1. It has a unique 
maximum at 0 but the two other modes matter. If desired, A can be chosen 
strictly positive on the interior of its interval of support. 

(5) Any of the classical estimators, such as the mean or the median, will be 
consistent in this situation, so the Bayes estimates do worse than available 
frequentist procedures. 

(6) In Diaconis and Freedman (1986), we argue that the Bayes estimate is 
consistent for any strongly unimodal density A: strong unimodality means that A 
increases up to its unique maximum and then decreases. Further, we show that if 
log a’ is convex, then the Bayes estimate is consistent for any symmetric A. 

(7) Doss (1983a, b, 1984) has carried out similar computations for neutral to 
the right and tail-free priors. He has also introduced some other methods of 
symmetrizing. Very roughly, his results parallel ours; it does not seem possible to 
find a consistent prior concentrated in a neighborhood of the normal location 
family. 

(8) A subjective rationale for mixtures of location models is discussed in 
Freedman and Diaconis (1982a). 

We now sketch the argument for inconsistency. As explained in Remark 3, the 
Bayes estimate is essentially 


ô, = [05(8) ao] faco) dé 
with (6) = [1 ,1/[1 + (X, — 6)*], the X, being the data. We have 


a(l) = ep - 3 log|1 + (X, - ay x exp{—nE [log(1 +(X,- 0)°)]}. 


i=l 
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Fic. 2. The function H. 


In the approximation, the sum has been replaced by its mean under the true 
sampling distribution. Now consider 


H(6) = E({log|1 + (X - 6)"]}. 


If X takes only the two values +a where a > 1, then H has a local maximum at 
0, a global minimum at +y, and tends to infinity at +0oo. Of course, H is 
symmetric. See Figure 2. . 

Now we estimate the integrals in the numerator and denominator of @, by 
Laplace’s method. As a function, @(@) is close to exp[ —-nH(@)], so only @’s near 
+y matter, and as n tends to infinity, 6, oscillates between +y and —y. The 
two-point distribution of X can be smoothed out to the density A shown in 
Figure 1. 

This argument is like the one used in Freedman and Diaconis (1982b) to show 
inconsistency of M-estimators. It has been objected that the counterexample 
density h is not in the “support” of the Dirichlet, since the Dirichlet chooses a 
discrete measure with probability 1. A technical response is that the Dirichlet 
assigns positive mass to every open set of probabilities, so h is in the support—the 
smallest closed set of full prior mass. A broader response is that the Dirichlet 
assigns zero mass to any particular probability. After the fact, therefore, any 
troublesome probability can be differentiated in some qualitative way from a 
Borel set of full prior mass. 

The foregoing discussion has all been asymptotic. What are the implications 
with finite samples? Consider an estimation problem with a large amount of data 
—and a large number of parameters. A Bayes rule will do well for most 
parameter values, “most” being defined relative to the prior measure itself. On 
the other hand, a set of parameters which is large as judged by one measure may 
be quite small as judged by another, so Bayes rules may be quite unsatisfactory 
for a practical frequentist. Bayesians usually argue that the data will swamp the 
prior—but this may not happen in high-dimensional inference problems, or it 
may occur, but very slowly. Our view is that the oscillation at infinity will even 
show up with large, finite samples—in high-dimensional problems. 
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On the other hand, even the most dedicated subjectivist will usually not insist 
on quantifying all the precise details of a prior opinion in a high-dimensional 
situation. Our results indicate that small changes in the details of a prior can lead 
to Bayes rules with much better operating characteristics. 

Quantifying these ideas is hard and that is why we give asymptotic results. We 
hope to study the finite-sample problem elsewhere, and in particular, we hope to 
quantify the extent to which small changes in the prior can make small changes 
in the posterior but big improvements in rates of convergence. The derivative of 
the posterior with respect to the prior is a relevant concept and will be discussed 
- in Section 3. 


2. Other connections between Bayesians and frequentists. Frequentists 
often discuss Bayes rules for the following reasons. First, the complete class 
theorem, as in Wald (1950), Le Cam (1955), Stein (1955), Sacks (1963), or Brown 
(1981), implies that all admissible procedures are approximately Bayes. Similarly, 
all minimax procedures are approximately Bayes. So not much is lost by confin- 
ing attention to Bayes procedures. Thus, Bayes procedures are convenient, 
tractable, and close to optimal. For instance, in some problems confidence sets are 
difficult to obtain. Welch and Peers (1963), Hinkley (1980), and Stein (1981) 
suggest using regions of high posterior mass, the prior being chosen so that 
Bayesian and frequentist coverage probabilities agree to several terms in an 
asymptotic expansion. Similarly, Bayesian techniques are used as a way of 
eliminating nuisance parameters. Berk (1970) gives examples of this in sequential 
analysis. 

The common thread is a kind of pragmatic use of Bayes methods by frequen- 
tists: theory and convenience suggest the use of Bayes rules. As long as a Bayes 
rule is to be used, one may as well work with a prior that concentrates on a 
plausible part of the parameter space. 

Second, Bayesian techniques can be helpful in proving frequentist theorems. 
For example, consider estimating the mean of a univariate normal with known 
scale and squared error loss. Any admissible estimator must be an analytic 
function of the observations. The only available proof uses complete class 
theorems of Sacks (1963) and Stein (1955) to represent the estimator as a formal 
Bayes rule. It is easy to show that formal Bayes rules are suitably smooth. 
Another example of this sort is in Matthes and Truax (1967). 

Naturally, frequentists have been interested in frequentist properties of Bayes 
procedures. Theorems dating back to Laplace (1774) show that the posterior 
distribution can be approximated by the distribution of the maximum likelihood 
estimator. Modern versions of these theorems can be found in Bernstein (1934), 
Von Mises (1964), Johnson (1967, 1970), Le Cam (1982), or Ghosh et al. (1982). 
Bayesians use these results to show that standard frequentist procedures are 
nearly Bayes. See Lindley (1965). 

There are many other points of contact between the frequentist and Bayesian 
schools. However, there is often some incompatibility between Bayes rules and 
frequentist desiderata. For example, Blackwell and Girschick (1954) followed by 
Blackwell and Bickel (1967) and Noorbaloochi and Meedan (1983) have shown 
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that there are essentially no unbiased Bayes procedures. Other points of dif- 
ference are emphasized in survey articles by Pratt (1965), Cornfield (1969), 
Lindley (1972), and Neyman (1977). Savage (1972) gives a review. 

Sufficiency gives a point of technical contact. Kolmogorov (1942) introduced 
the notion of Bayesian sufficiency—a statistic is Bayes sufficient if for any prior 
the posterior only depends on the data through the sufficient statistic. For 
smooth, finite-dimensional problems, Bayesian sufficiency is the same as the 
frequentist concept. Recently, Blackwell and Ramamoorthi (1982) give an in- 
finite-dimensional example where the two notions disagree. 

Points of agreement arise in Bayesian discussions of robustness as in Berger 
(1984) or Kadane and Chuang (1978). Here frequentist properties of Bayes 
procedures are used as a means of protection from naive specification of the prior. 
A similar compromise was suggested by Hodges and Lehmann (1952). 


3. Bayesian interpretations of consistency. It is useful to separate Bayes- 
ians into two groups: we will call them “classical” and “subjectivist.” Classical 
Bayesians, like Laplace or Bayes himself, seemed to believe there is a true but 
unknown parameter which is to be estimated from data. This parameter is part 
of an objective probability model for the data. Prior opinion about the parameter 
is expressed as a probability distribution. Subjective Bayesians like de Finetti 
and Savage reject such ideas; for them, probabilities represent degrees of belief 
and there are no objective probability models. (Freedman used to be a classical 
Bayesian, while Diaconis is a subjectivist.) 

Consistency properties of Bayes rules are clearly of interest to classical 
Bayesians; as data comes in, the posterior should converge to point mass at the 
true parameter. We will now argue that frequency properties of Bayes rules are 
also of interest to subjectivists. The first reason has to do with “intersubjective 
agreement.” In some circumstances, Bayesians learn from experience, so opinions 
based on very different priors will merge as data accumulates; the data swamps 
the prior. We will now argue that consistency is equivalent to merging of 
intersubjective opinions under certain conditions. 

A general result of this type was provided by Blackwell and Dubins (1962). To 
state their result, assume that P and Q are probabilities governing a process 
(Xi Xa...) Let P, and @, be regular conditional probabilities for the future 
(X41, Xna...) given the past (X,, X2,..., X,). We think of P and Q as the 
priors of two Bayesians, and assume P and Q are mutually absolutely continu- 
ous: P = Q, the Bayesians agree on what is possible or impossible. Blackwell and 
Dubins show that almost surely, P, and Q,, merge in variation distance. 

Reverting to the context of Section 1, assume that @ > Q; is a homeomor- 
phism and consider two Bayesians with priors p and », respectively. Now p = p if 
and only if P, = P,, and p L v if and only if P, L P,. So the Blackwell—-Dubins 
result does not apply when u L v, as is usually the case with Dirichlet priors. 
However, even if merging in variation distance does not happen, “weak-star 
merging” is still a possibility. We say two sequences {a,,}, {8,} of probabilities 
merge weak-star if and only if a, and B, become indistinguishable from the 
point of view of integrating bounded continuous functions: more precisely, 
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R(a,, Bn) > 0 for every weak-star continuous function R defined on pairs of 
probabilities satisfying R(a, a) = 0. As an example, R might be a metric for 
weak-star convergence; or R(a, 8) might be {fda — ff dB. 

We show that the version p,, is consistent if and only if P, and P, merge 
weak-star for any v, in the following sense. Recall that P, is the joint distribution 
of 0 and X), X2, ... . Let P,„ be the law of Xn} X,19,---, given X,,..., X,, 
determined by the formula 


Pam fun (a6). 


A Bayesian with prior p holds the opinion P,,, about the future X,,1, Xn+2---> 
after seeing the data X,,..., X,,. 


THEOREM 3. Suppose 0 — P, is continuous, one-to-one, and continuously 
invertible. Suppose the version p„ is consistent. Let v be any other prior. As 
n > œ, P,, and P,, merge weak-star along P.-almost all sequences. Conversely, 
if P,n and P,, merge for all v, then p,, is consistent. 


Informally, the prior » is consistent if and only if any other subjectivist with 
prior v is sure that P, and P, will merge on the future, as more and more data 
comes in. Theorem 3 1s proved in Appendix A. Similar issues were considered by 
Lockett (1971). 

This completes our discussion of “intersubjective agreement.” We turn to our 
second reason for thinking that frequency properties should interest subjectivists 
as a way of specifying priors. The idea is simple: after specifying a prior 
distribution, generate imaginary data sequences, compute the posterior, and 
consider whether the posterior would be an adequate representation of the 
updated prior. This is quite close to the checks for coherence proposed by 
de Finetti and Savage; see de Finetti (1974, pages 229-246) for examples and 
Savage (1971) for a review. This use of fictitious samples was suggested by the 
“device of imaginary results;” Good (1950, page 35) uses this device as a method 
of roughly quantifying a prior in difficult situations. We call it the “what if” 
method: what if the data came out that way? 

Take the prior suggested by Dalal (1979b) for use in symmetric location 
problems as discussed in Section 1. If data were generated from the density h of 
Theorem 2, the posterior would oscillate and never converge to the center of 
symmetry. The “what if’ method strongly suggests modifying the prior. We give 
further examples and discussion in Diaconis and Freedman (1983). 

The “what if” method suggests various interesting mathematical problems. 
For example, to what extent do the Bayes estimates determine the prior? 
Diaconis and Ylvisaker (1979) show that the posterior mean of the natural 
parameter in an exponential family determines the prior under suitable regularity 
conditions. Diaconis and Ylvisaker (1985) give counterexamples in location prob- 
lems. 

For a fixed observation x, Bayes theorem gives the relationship between the 
posterior and the prior. This defines a map from measures to measures. In the 
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next theorem we calculate the derivative of this map and the norm of 
the derivative. This helps to identify data sets x where small changes in the prior 
cause large changes in the posterior. These may be the most informative x’s for 
the “what if” method. (Of course, they may also be unlikely x’s.) 

In the theorem, probabilities are considered as a subset of all signed measures, 
with distance defined by the variation norm: the distance between u and v is the 
total mass of the signed measure u — v. Equivalently, 


d; 
m-i fZ- 


where o is any dominating o-finite measure, e.g., o = p + y. With this norm, the 
measures form a Banach space. 

Suppose {Q,: 0 € @} is a “dominated family;” there exists a o-finite measure 
à on X with all the Q,’s absolutely continuous with respect to À. Let f(x]8) be 
the density of Q, with respect to A: we write dQ, = f(*|@) dA. We will assume 
that f is jointly measurable and sup, f(x) < oo for every x. For a probability u 
on 9, define 





T(u) = N(n)/D(p}, 
where 


N(u) (d8) = f(x|9)n(d0) and D(p) = Í, f(x|0)u( dé). 


Thus, N(p) is a measure and D(p) is a number; T() is the posterior distribu- 
tion. We confine ourselves to the x with D(a) > 0, a set of P-measure 1. The 
dependence on x is suppressed for the moment. The map p > T(1) takes priors 
into posteriors. It has a derivative T, at u. This È, is a linear map on signed 
measures such that 


T(u +8) = T(n) + F,(8) + o(ljôl) as lê > 0, 


where 6 is a signed measure with signed mass 0. The norm of 7, will be used as a 
measure of the effect of a small change in p. This is defined by 


IÍ = sup |Z,(8)}. 
18\[<1 


Let sup, f(x}0) = sup,{ f(x|@): {0} = 0}, the sup over an empty set being 0 by 
convention. Under the given conditions: 


THEOREM 4. (a) $8) = [N(8)/D(n)] — [N(#)D(8)/D(p)*1- 
(b) IÈ < sup, f(x|8)/D(n). 
(© IË = sup, f(x10)/D(n). 


Theorem 4 is proved in Appendix B. In (a), the quantity N() is a signed 
measure, normalized by D(u). Likewise, N() is a signed measure, normalized by 
the factor D(5)/D()’. Thus, 7,(8) is a signed measure. As is easily seen, this 
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measure has signed mass 0. For many priors the upper and lower bounds of (b) 
and (c) coincide. Then conclusion (b) has a simple interpretation in terms of 
likelihood ratios: the x’s where the posterior is most sensitive to small changes in 
the prior are the x’s which have high ratio of objectivist likelihood to subjectivist 
likelihood. These x’s are the ones where the “what if’ method will be most 
informative: at such x, small changes in the prior can make big changes in the 
posterior. 

As an example, consider a normal location problem with known variance o*. 
Without loss, suppose there is only one observation. Let » be a normal prior for 
the location parameter 6. Suppose that » has mean po and variance of. Then 
D(u) at x turns out to be the normal density at x with mean py and variance 
og + a”, The norm computed in (b) is 


[(0? + ož)jo?] ” exp( (x — no)? /(0? + o3)}. 


This is large when x is far from po. We have carried out similar computations for 
the other standard one-dimensional exponential families with conjugate priors. 
The results are similar: values of x far from the mean of the prior lead to large 
norms. 

On the other hand, for examples of the type considered in Freedman and 
Diaconis (1983), the posterior concentrates at some distance from the maximum 
likelihood estimate, so {f du will be many orders of magnitude smaller than max f 
for most data sets, relative to the true sampling distribution. Such data sets will 
have high leverage, and a posterior, which may seem on reflection to be unsatis- 
factory, so the “what if” method might prompt revision of an inconsistent prior. 
This completes our discussion of the “what if’ method and with it Bayesian 
defense of frequentist analysis. 


4. Conclusion. We return now to the big picture. There is a probability 
model for data and some of the parameters are to be estimated. A statistician 
who really has a sharp prior probability distribution for these parameters should 
use it, according to Bayes theorem; inconsistency on a null set of a priori 
probability zero is an irrelevant nuisance. On this point, there seems to be general 
agreement in the statistical community. 

Often, a statistician has prior information about a problem (say as to the 
rough order of magnitude of a key parameter), but does not really have a sharply 
defined prior probability distribution. Many different distributions would have 
the right qualitative features and a Bayesian typically chooses one on the basis of 
mathematical convenience. In smooth, low-dimensional problems, this ought to 
help, and anyway cannot lead to disaster, because the data will swamp the details 
of the prior. 

Unfortunately, in high-dimensional problems, arbitrary details of the prior can 
really matter; indeed, the prior can swamp the data, no matter how much data 
you have. That is what our examples suggest, and that is why we advise against 
the mechanical use of Bayesian nonparametric techniques. 


CONSISTENCY OF BAYES ESTIMATES 15 
APPENDIX A 


Merging of posteriors and the weak-star topology. We suppose © is a 
Borel subset of a complete separable metric space—a Borel set for short. Next, 
for each 6 € @, we have a probability Q, on the Borel subsets of another Borel 
set T. Thus, @ is the parameter space and % the observation space. The map 
@ — Q; is assumed to be 1-1 and Borel. For a prior probability p on ©, we write 
P, for the probability on © x 2° defined by 


P(A x B) = 5 Q(B) u(d0), 


where A is Borel in @ and B is Borel in #°%. As is easily verified, if H is a Borel 
subset of O x Z”, then 


(A.1) P(H) = T (3, x QF)(H) u(dé). 


As usual, 5, is point mass at 8. 

Fix a version p, = #,(d¢|x,,...,x,) of the P-law of 0 given X, = %),..., 
X,, = X,- Here, as elsewhere, 0 denotes the coordinate map 

(8, Xi £a.) > 0. 

The posterior u„ may be considered as a function on © X Z”, or just ¥™, or 
even 2", according to convenience. 

Consistency of (9, »,,) depends not only on p, but also on the choice of the 
version u„, which is well defined only a.e. An artificial example may clarify the 
point. 


EXAMPLE A.l. Let © = [0,1], Z= [0,1], Q, = 5,, and let p be Lebesgue 
measure on ©, Let 
#,{@o|x,,.-.,X,} = 8, 
unless x, = 0 in which case 
Baldoz- In} = 81,2. 
This p,, is a (stupid) version of the law of 6 given X,,..., X, because 
P{X, = 0,..., X = 0} = 0. 
For this version, », is consistent except at 0 = 0 when p, > 6,2. 

Bayes estimates of an unknown probability on the line provide a less artificial 
example, where {Q,} is not dominated by a o-finite measure. Then, the posterior 
can be changed on a set of measure zero, so consistency is determined for p 
almost all @ but not for all 8. 

The next topic is the weak-star topology for probability measures on Q. Recall 
that © is a separable metric space by assumption, but © need not be complete, or 


compact, or locally compact: for example, © might be the irrational numbers in 
the unit interval. Let @ be the set of bounded, continuous functions on 8. If a, 
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and a are probabilities on ©, then a, > a weak-star if and only if {fda, > {fda 
for all fe @. 

A technical problem is that ¢ will not usually be separable. However, it is 
enough to consider only the uniformly continuous f € ¥: to be specific, let O, be 
a countable dense subset of O. For each f, € ©, and nonnegative rationals r < s 
and arbitrary rationals u, v, define a function f = f, , u,ọ on @ as follows: 


f(@) =u forð with p(@,8) <r 
=v for 6 with p(6,,6)2>s 
while 





HO) ape (4,4) 4 polos 9) r 
-r s-r 

for @ with r < p(o, 80) < s. Here, p is the given metric on ©. Clearly, f is 
bounded and uniformly continuous. Let €, = {f}, a countable collection. Let 
6 = {fo --- ofp for fE G and k=1,2,...}. Let @,= {fi A ++} Af,: for 
fE @ and k =1,2,...}. Here V and A are pointwise max and min, respec- 
tively. Now @, and @, are countable subsets of €, and pointwise dense, as 
follows. 


LEMMA A.l. If f€ @, there is a sequence fE E, with f, t f pointwise; 
and another sequence g, € €, with g, 4 € pointwise. 


Proor. Use the method of exhaustion: for instance, if f € Ẹ, then f = inf{g: 
ge @ andg2f}.0 


COROLLARY A.l. If ffda,— {fda forall f E €U ¢,, then a, > a weak- 
star. 


We can now prove Doob’s theorem in our context. 
COROLLARY A.2. pt, > 5, weak-star, P -almost surely. 


Proor. Fix an fe €U @,. By Corollary A.1 it is enough to prove that 
ffdu, > {(@) ae. P,. But ffdu,, is a version of E{ f(@)|X,...X,}, s0 ff den > 
E{ f(X., Xo,--.} a.e. by martingales. The last step is to prove identifiability: 
that ð can be computed measurably from all the Xs, a.e. P.. 

The identifiability argument will only be sketched: it is at this point that 
“1~1-ness” matters. Since 6 —> Q, is 1-1 and Borel, so is the inverse: see 
Kuratowski (1958, Section 35.5). So we only need to compute @, from X,, X,.... 
Using Corollary A.1 again, we only need to compute fgdQ, for gE ELU €, 
But 


(A.2) feed, = tim ~[e(X,) + +E) 


a.e. by the law of large numbers. 
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In a bit more detail, let H, be the subset of @ x #° where (A.2) holds. Then 
H, is Borel, and ô; x Q7(H,) = 1 for all 6 by the law of large numbers. So 
P(H,) = 1 by (A.1). 0 


The following lemma and corollary are used later. 


Lemma A.2. A Borel set is homeomorphic to a dense Borel subset of a 
compact metric set. 


Proor. This standard fact can be proved by appealing to Urysohn’s lemma 
(Kuratowski, 1958, page 119) and then (Kuratowski, 1958, page 397). O 


COROLLARY A.3. 0 > QF is weak-star continuous. 
Proor. Let f be bounded continuous on ” and 
e(9) = | faQ. 
re 
We have to show that g is continuous. 


CasE 1. & is compact. Then f can be approximated by finite linear combina- 
tions of functions I1*_, f,(x,) with f, bounded continuous on X. 


THE GENERAL CASE. Embed % as a dense Borel subset of the compact 
metric space (¥, p) and metrize ¥ by p: see Lemma A.2. The bounded, p-uni- 
formly continuous functions on Z are those which extend to continuous func- 
tions on Z. Now use Corollary A.1 to reduce the general case to Case 1, the point 
being that a function g which is both upper and lower semicontinuous is 
continuous. 0 


The next topic is “weak-star merging.” Let a„ and &, be probabilities on ©; 
the sequences {a,,} and {8,,} merge weak-star if and only if R(a,, 8a) > 0 for all 
continuous functions R which satisfy R(a, «) = 0. Some possible choices for R: 


R(a, B) = fida- fiag for a fixed bounded continuous f, 


(A.3) R(a, 8B) =(a,B) with À a metric for the weak-star topology, 
R(a, 8) = T(a) — T(8) fora fixed bounded continuous function T. 
For the next theorem, we also need 


P n= Í Q7en( do). 


This is the “predictive” distribution of X,,,, X,42,---, given X,,..., X,. Of 
course, P,, = P,,(+|x,-..%,) is a probability on %*, depending on the first n 
data points. 
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We will prove somewhat more than Theorem 3. To state the result, let 
GCOXxZ™ bethe set of (9, x,, x9,...) such that p,(d¢|x,...x,) > 8, weak- 
star. Notice that G depends on u„ and y,, is consistent if and only if (6, x 
QF XG) = 1 for all 9. 


THEOREM A1. Let 6 > Q, be 1-1 and Borel. Fix a prior p on © and a 
version p.,, of the posterior. The following three conditions are equivalent. 

(i) u, is consistent. 

(ii) P(G) = 1 for all probabilities v on 8. 

(ili) (delz... £p) and »,{do|x,...x,} merge weak-star as n > œ, with P, 
probability one, for all probabilities v on ®. 

Suppose that § > Q; is continuous and has a continuous inverse. Then (i) is 
also equivalent to 

(iv) P,, and P,, merge weak-star as n > œ, with P, probability one, for all 
probabilities v on ®. 


Proor. Fix v. We will prove that (i) = (ii) = (iii). The first implication is 
trivial: from (A.1), 


(A.4) P(G) = [ (8, x QF)(G)»(d8) 


and (ô; X QF XG) = 1 for all @ from the definition of consistency. For the next 
implication, suppose (ii) for v. We will argue that ae. P,, 


(A.5) En? 5,, 
(A.6) v, > dp. 
The notation may be a bit confusing: 9 is being used for the coordinate function 
(0, Xi, X,...) > @ and its value @. It would follow from (A.5-A.6) that R(E, Pn) 
> 0. Now (A.5) holds on G, which has P.-probability 1, by assumption. And (A.6) 
holds ae: P, by Doob’s theorem, Corollary A.2 above. Thus, (i) = (ii) = (iii). 
Clearly, (iii) = (i): take v to be point mass and R as in (A.3). 

Now assume that @ —> Q, is 1-1 and continuous. We will argue (i) = (iv). 
Indeed, suppose (i). We will show that a.e. P, 
(A.7) Pan > OF; 
(A.8) P,a > QF- 
For (A.7), let f be bounded and continuous on £”. By definition, 


fpt Pin = f f, faaeu,( de). 


Now g: 0 > ff dQ? is a bounded, continuous function on O, in view of Corollary 
A.3. In view of (i), fgdu, > g(0) ae. QF for all 0. Thus, fgdp, > g(0) ae. P.. 
This proves (A.7) via Corollary A.1, letting f run through the analogs of € , and 
€, on X>, For (A.8) by Doob’s theorem, ff dP, > g(0) a.e. P,, and the rest of 
the argument is the same. Thus, (i) = (iv) in the presence of continuity. 
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Finally, assume (iv), and let v = 8, so P, = 6, x QP and P,, = QF ae. QF. 
Also, P,, = [QP pdo) > QF ae. QF by (iv). Thus U(n,) > U(S,) ae. QP 
and p,, —> ô a.e. QF, by Proposition A.1 below, where U is defined. 0 


REMARKS. (1) Consider a Bayesian with prior u who chooses yp, as the 
posterior. Consider a second Bayesian with the generic prior v. Now G is the set 
where the first Bayesian gets the parameters @ right. Condition (ii) in the 
theorem is that any second Bayesian will be sure that the first Bayesian gets 6 
right—whatever @ may be. Condition (iii) is that any second Bayesian is sure 
that his posterior will merge weak-star with that of the first Bayesian. (This is 
well defined, since P,-null sets do not matter here: any version », can be used.) 
Condition (iv) is that any second Bayesian is sure that his conditional opinion of 
the future given the past will merge weak-star with that of the first Bayesian. 

(2) The implications (iii) = (i) or Gv) = (i) are valid with much weaker 
notions of merging. Essentially, only one R is needed, provided R vanishes on 
the diagonal and is positive off the diagonal. Subfamilies of functions R lead to 
different notions of merging: 

(i) {fda, — ff dn > 0 for every bounded continuous f. 

(ii) Alan, Ba) > O for À metrizing the weak star topology. 

(iii) T(a,,) — T(B,) > 0 for every continuous function on the probabilities. 
These notions are all different, even on R. For example, let a, = §,, B, = Sisi/n 
with ô, a point mass at x. Then A(a,, 8,) > 0 for Prohorov’s metric, but there is 
a bounded continuous f with /fda, = 1, ff dB, = 0. Thus (ii) is different from 
(i). To see that (iii) is different from (ii), take a, = 6, 8, = (1 — (1/n))6, + 
(1/n)6,,,,- Take T(a) = a(1) + a(2)? + --- t+a(n)" + --- . Now T(a,) = 1 but 
T(B,) > 1/e. 

The notion of merging we use implies merging in the above senses and is 
equivalent to merging in the finest uniformity compatible with the weak star 
topology (see Kelley, 1955, Chapter 6). For further discussion of these issues, see 
Diaconis and Freedman (1984) and Dudley (1966, 1968). 

(3) The continuity conditions are needed to conclude (iv) from (i). To see this, 
take © = [0,1] and #= {0,1} and let 


30 for 8 e (0, 4), 
f(@) =(9 for 9 = 1, 
4426 ford e(4,1]. 


Let Q, be f(@)-coin tossing. Let u be the uniform distribution on ©. Straightfor- 
ward analysis shows that p is consistent, but the predictive distribution P,,, does 
not merge with P,, when v = 6,,.. Lockett (1971) gives a similar example 
involving geometric variables. 

(4) The continuity conditions are needed to conclude (i) from (iv). To see this, 
take 8 = [0,27) and Z as the unit circle. Let F(@) = e map 8 onto T. This F 
is 1-1 and continuous, but does not have a continuous inverse at ¢ = (1,0) € Y. 
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Define Qe = me, Let p be uniform on ©. Define a posterior, maliciously, as 
ORE ee Xn) z Ôr- ien) if x, + 3 
= bee-iyn fx =$. 


So (0, p„) is not consistent. But P,, > QF a.e. QP, for all 8, even @ = 0, and so 
merges with P,, a.e. P, for any v. 

The following is required to complete the proof of Theorem A.1. If S is a Borel 
set, we write 7(S) for the set of probabilities on S, endowed with the weak-star 
topology. The set 7(S) is Borel too. Let S be a Borel subset of S. Let 
Po = {ulu E a(S) and p{S,} = 1}. Then #, is a Borel subset of 7(S). Let M 
map #, onto 7(S)) as follows: if a € P , then Ma is the restriction of a to the 
Borel subsets of S$. Thus, M maps probabilities on S to probabilities on S,. We 
write M = M(S, $) to show the dependence on the spaces S and $. 


LEMMA A.3. M is a homeomorphism of P, onto m(S,). 
Proor. Use Corollary A.1. O 


Using Lemma A.3, we can view % as a dense Borel subset of the compact 
metric space (F, p); metrize Z by p. Then 7(2) can be viewed as a Borel subset 
of the compact metric set 7(2%). For u a probability on O, let 


Ulu) = f| QF n(d8). 
8 
So U(n) is a probability on Z°”. 
PROPOSITION A.1. U is a homeomorphism of 7(@) into a(#@). 
Proor. Let 6 = 7(Ẹ). Let U map 7(0) into 7(#%) as follows: 


O(n) = [9° uao). 


Then U is continuous, and 1-1 by de Finetti’s theorem (Diaconis and Freedman, 
1980, Appendix). Since 7(@) is compact, U~! is continuous. Next, suppose ¥ is a 
Borel subset of #, © is a Borel subset of 7(%), and Q; = 9. In this setting, 
U(u) = fof” n(d6). Now apply Lemma A.3 twice, with M, = M(®,®) and 
M, = M(#“, £”). Then U = M,°Ue M,! is a homeomorphism on 7(9). To 
see that the composition makes sense, let p € 7(@) and y = M7 `w; then v € 7(@) 
and »(@) = 1, so UXT”) = 1 because 6(#) = 1 for 0 € O. 

The general case is almost immediate: @ — Q, being continuous and 1-1 on O, 
the image © under this mapping of @ is a Borel subset of (4): see Kuratowski 
(1958). Apply the previous argument with 6 in place of @. Let M, map 7(@) 
onto 7(@) by the recipe (M,#)(A) = »{@: Qg € A} for Borel A C ©. This M, is 
a homeomorphism because 0 > Q; is. Now U = M, ° U o M;* M, is a homeo- 
morphism too. O 


PASS1S 
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The derivative of the posterior with respect to the prior. Proceeding 
heuristically for a moment, the derivative of the ratio T(u) = N(p)/D(p) is 


T,= [D(u)N, - N(u)D,|/D(u)”. 


Now N(p) is linear: N(u + 8) = N(u) + N(8). So N, = N(+). Likewise, IX p) is a 
linear functional, so D, = D(+). For example, in our sense, the derivative of the 
linear function $: x — 3x is just $, because (x + h) = $(x) + $(A). 
The upshot is T, = R,, where 
R(-) = NY) > MY) 

j D(p) D(py’ 
This is part (a) of Theorem 4. For a rigorous proof, we must show that for any 6 
with signed mass 0, 





D(-). 


T(u + 8) = T(u) + R,(5) + o(18). 
The difference T(u + ô) — T(u) — R,(8) is easily seen to equal 








D(8) D(8) 
(xe T MODE) eerie 76) | 
The norm of this is smaller than 
D(8) |D(5)| 
(B.1) TaD Ment INC) Din) | 





Let C = sup, {(x|@) < œ by assumption. Then |D(4)| < C||6|| and |jN(8)|| < 
C\j4\|. Further, D(u + 6) tends to D(u) as 6 tends to 0. It follows that the bound 
(B.1) is smaller than C()||6||? for |{8|| small. This completes the proof of part (a). 
_ To prove part (b) of Theorem 4, fix x and write f for 0 > f(x|@). Let 
f = f[/D(p) so ffdp = 1. Then 

di (8) = fdé — (ffas) Fan. 


Choose o to dominate both |ô| and p, e.g., o = |5|+ u. Then dé = èdo and 
du = ù do and 


dt (8) = [s = (ffas) fdo. 
We must show 
(B.2) IT. < Sil - sup Î. 


Indeed, 
[sido = fias and fifdo= fidp = 1. 
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Thus, 7,5) has signed mass 0, namely, 7,(5)[@] = {fdo — {fdo = 0. It follows 
that 


B3) I= 2f [s -(f fas)a| fao = fiè - | ffas ifado. 


Assume without loss of generality that ffdê > 0. For a > 0 and real d, clearly 
(d — a)*< d*. So 


IZ,(8)I| < 2 [3*fdo < a fèt do )(sup f). 
But 
[8* do = } fibi do = 4181 


because {5 do = f dé = 0. This completes the proof of (B.2), and hence part (b) of 
Theorem 4. 

To prove part (c) of Theorem 4, we must show that for every e > 0 there is 
signed measure ô with signed mass 0 and total mass 1, satisfying 


(7,(8)|] = (1 — e) sup, f. 


Choose @ with {9} = 0 and f(4)) > (1 — e)sup)/. Let 6 = 4(8,, — 4), where 
5,, is point mass at 4. Let o = p + 6. Then the rightmost expression in (B.3) can 
be evaluated, and is f(8,), s0 


IEE) = Fo) = (1 — e) sup f- o 


REMARKS. (1) Ordinarily, sup, f = sup f, so the theorem determines IT, jj. I£ 
ef, #{9 } > O and f(4) > sup{f(): 0 + 65}, then sup f > sup, f. In this case, 
IT, || is hard to compute. However, it can be shown that IŻ, I| = sup||7,(8)| where 

= 75, — 45,, and 6, + @, vary over 8. 

(2) It 6 ô is eed to be absolutely continuous with respect to p, a similar 
argument shows that KQJ < p-ess. sup f, the inequality being sharp if e.g., p is 
continuous. 

(3) We have chosen to differentiate in the set of signed measures. Our strong 
derivative is called the “Frechet derivative.” Another standard way of perturbing 
u is to consider the mixture (1 — e)p + ev for some probability » as e tends to 
zero: the “Gateaux derivative.” The mixture can be written as p + e(» — a) and 
the notions of derivative coincide—for bounded likelihood functions, 

(4) Similar computations can be carried out for Bayes rules. If @ is a real 
parameter, the mapping M 


_ {0F(210) pCa) 
JEO) eCa) 

has derivative 
N,(-)  N(g) 
~ D() D(a 





D(*), 
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where N,(5) = fOf(x|@) 8(d@) and D(8) = {f(x|@)u(d@) as before. The norm of 

M, is computed as follows. Let c = N,(#)/D(1). This is the Bayes rule based on 

u. Let g(0) = (0 — c)f(x|@). Define range g = sup g — inf g. Then 
IM, = 7 range g. 

(5) Theorem 4 assumes a dominated family. In the undominated case, the 
derivative need not exist in our strong sense. The difficulties can already be seen 
in the following simple example: take Z and © to be the real line mod 10. Let 
Q,{x} = 4 if x = 8 + 1; suppose the prior u for @ has continuous density ï on 0. 
the posterior for 9 given x is supported at x + 1 with mass 


se ee ea 
(B4) r(u)(oy = 4 1 2. R 
ifé@=x+1. 


f(x-1) +f(x+1) 


The map T is norm continuous at no x. To see this, consider a sequence of 
continuous prior densities f„ converging to f in variation distance but pointwise 
at no point. More specifically, let s,=1+4+---+1/n. Let g, on the line 
vanish to the left of s, — (2/n); increase linearly to the value 1 at s,; decrease 
linearly to zero at s, + (2/n); and vanish to the right of that value. Wrap g, 
around the line mod 10; let f, be the sum of f and the wrapped g,, normalized 
to be a density. Clearly f, > f in L,, but f(x) — f(x) for no x. 

The argument is only sketched. Fix a real number @ with 0 < @ < 10; for any 
integer k, the real number k + 0 wraps to the same point @ in ©. For infinitely 
many n, for some k=k,, we have s,<k+6<s,,,. Then g,(k + @) 2 i, 
because s,,,, — 8, < 1/n. For such n, we have 


1 
10) > |1(0) + 515 


because fg, = 2/n. Since g, has only one bump, this can be either at x + 1 or 
x — 1, but not both. Thus, for any x, for infinitely many n, we have both the 
following relations 








f(x +1) > [e+ + z] 


hl - 1) =i- 9): 


So f, does not converge pointwise, and from (B.4) the map T is not continuous. 

In this example, the posterior is Gateaux differentiable: A derivative can be 
calculated by considering (1 — e)f + eg as e tends to zero. The Gateaux deriva- 
tive exists quite generally, as we will show elsewhere. 

(6) In the situation of Theorem 4, the same result holds if the weak-star 
topology is used instead of the norm topology, provided f is bounded continuous 
in 8. 

(7) A related computation is contained in Huber’s (1973) discussion of Bayes- 
ian robustness. 


n+2’ 
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1. General remarks. Diaconis and Freedman have demonstrated some ad- 
vantages and pitfalls of Bayesian inference. In summary, their results include the 
inconsistency of location estimates based on a Dirichlet prior; the equivalence of 
weak consistency and weak merging of posteriors; and an analysis of the sensitiv- 
ity of the posterior to changes in the prior. In this discussion, we provide 
additional insight and point toward new developments. It is argued that the 
Dirichlet is a poor choice of prior because the Dirichlet mixture has a likelihood 
which is exponentially smaller than every product likelihood. We give conditions 
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on the choice of the prior such that the mixture likelihood is close to virtually 
any product likelihood. Implications for Bayes consistency are discussed. 

Consider the general problem of Bayes estimation on the line. We denote the 
distributions on the line by F, the product distributions on R” by F”, and the 
prior on distributions by p. The mixture of product distributions is G“ = 
{F'"u(dF). (According to a Bayesian with prior u, this G is the distribution of 
the data.) Following the “what-if” principle of Diaconis and Freedman, we 
assume that X,, X,,... are independently drawn from a distribution F, with a 
probability density function f,(x). The joint density f,(X,, X,,...,X,) = 
T1”_,/+(X,) (evaluated at the data) is called the product likelihood. Let 
E(X Xz ---, Xn) be the density of the absolutely continuous component of G™. 
Ideally, we want the mixture likelihood g(X,, X2,..., X,) to be close to the 
product likelihood f ,(X,, Xos... Xn) with high F$ probability, for virtually 
any such distribution F,. 


2. The location problem. Diaconis and Freedman consider the Bayes 
estimation of a location parameter 0 from data X, = 0 + £, where the e, are 
independently drawn from a distribution F. If the prior p on distributions F is 
taken to be independent of 8, then the posterior distribution of 6 given 
Xp X2,.--, X, depends on the prior p only through the mixture distribution 
G = {F"u(dF). For instance, if G is absolutely continuous on a set A which 
is invariant under translations of each coordinate by 9, then the mean of the 
posterior is given by @ = f@p(@)g(X, — 6,..., X, — 9) d6/fp(@)g( X, — 8, 
..+;4, — 0) dô for X,,..., Xn in A, where p(@) is the prior density of locations. 

Is the mean of the posterior consistent? As a degenerate example, suppose the 
prior u is point mass at a distribution F with density f. If this prior guess is 
exactly right, F = F,, then the posterior mean is consistent for almost every 6 
(and consistent for every @ if the density f, is smooth; see Schwartz, 1965). 
Whereas if the prior guess is wrong, F + F, then Diaconis and Freedman (1986) 
show that the posterior mean may be inconsistent. (Surprisingly, F # F,, does 
not necessarily imply inconsistent posterior means. Diaconis and Freedman state 
that the location estimate is consistent for essentially any F, if F has a 
log-concave density.) A naive reaction to this degenerate case is, “ Why consider 
point mass at a single distribution F, when there are priors like the Dirichlet for 
which all distributions are in the (weak-star) support set?” 

If p is a Dirichlet prior with absolutely continuous base measure a (having 
standardized density f = a’/|la|), then the Korwar—Hollander result (which 
Diaconis and Freedman (1986) uses) establishes that the distribution G™ is 
absolutely continuous on the set of sequences with distinct X,, and the likelihood 
g(X,, Xq,..., Xn) is proportional to []7_, f(X,). (The set with distinct X, has 
probability 1 with respect to any continuous distribution F'2, but exponentially 
small probability with respect to G‘”.) Consequently, for the location problem, 
the Dirichlet prior yields exactly the same estimator as the degenerate prior 
which places point mass at a single F! Because of this degeneracy, the Dirichlet 
prior is useless machinery for the location problem. We would prefer to use a 
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prior which (via the mixture) can simultaneously mimic a larger class of iid 
distributions. 


3. Matching likelihoods. We need a useful definition of closeness of likeli- 
hoods. Since joint densities tend to grow (or shrink) exponentially (Barron, 
1985a), the following definition is suggested as a natural property. A sequence of 
likelihoods g(X,, X,,..., Xn) is said to match likelihoods f(X, Xo,..., Xn) if 
for any e > 0, 


era Xi, X23003 Xn) S fal Xi Nous Xn) Se PUN Xt. Xaho 


for all n sufficiently large, with F? probability 1. (Equivalently, 1/n times the 
log-likelihood ratio tends to zero.) Matching may be thought of as a strong 
merging property of mixtures. The first inequality in the definition holds without 
conditions (by application of Markov’s inequality and the Borel—Cantelli lemma). 
The second inequality holds only for well designed mixtures. 

What conditions on the prior u are sufficient for matching? Let ||F, — F'| 
denote the total variation distance and let D(F,||F) = E log dF, /dF denote the 
relative entropy (Kullback—Leibler divergence). Either of the following condi- 
tions is sufficient for the mixture likelihoods g(X,, X.,..., X,,) to match the 
product likelihoods f(X, X»,..., X,): 


(a) The prior assigns strictly positive mass to the relative entropy sets: 
p{F: D(F,||F) < e} > 0, forall e > 0, 
or 


(b) The prior assigns non-negligible mass to the variation distance neighborhoods 
in the sense that there exists e, with Le, < oo such that 


u{F: nF, — Fil <e,} > eo. 


Moreover, (a) and (b) each imply a local matching property. Let 
g( X,, Xq,..., X,|N) be the density of the absolutely continuous component of 
the conditional distribution G(-|N) = fyF"(-)u(dF)/p(N). The prior is said 
to locally match point mass at F, (weakly /strongly) if for all (weak-star /varia- 
tion distance) neighborhoods N the likelihoods g(X,, Xo,..., X,|N) match 
f(X Xq,..., Xn). Note that local matching implies matching. The proofs that 
(a) and (b) each imply strong local matching are implicit in Schwartz (1965). 


4, Bayes consistency. What are the implications of matching for con- 
sistency? Local matching implies weak consistency of the posterior, but it also 
implies more. Let’s define weak, strong, and intermediate forms of consistency. 
Let ||F. — Fl, = X4c.|F4(A) — F(A)| be the variation distance on a partition 
a of the line. The total variation distance is ||F, — FI] = sup,||F, — F'||,. Sets of 
the form N, = {F: |F, — Fll, < £} and N= {F:; ||F, — Fl] < e} are, respec- 
tively, weak-star and total variation neighborhoods of F. A sequence of posteri- 
ors pp = #(-|X,, Xo,..., Xn) is strongly consistent for F, if the posterior mass of 
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variation distance neighborhoods tends to one, »,(N) > 1, F}-almost surely, for 
any e > 0, and weakly consistent if »,(N_) > 1 a.s. for any finite partition 7 
consisting of sets with boundary measure zero. Now for the intermediate defini- 
tion: A sequence of posteriors #,„ is said to be consistent for F in w,-variation if 
u,(N,,) > 1 a.8., where 7, is a countable partition of the line into intervals of 
width w,. We require that the widths w, tend to zero. 

Why should we care about intermediate consistency? It is shown in Barron 
(1985b, 1986) that for any prior p, local matching implies weak consistency 
and consistency in w,-variation if lim nw, > 0 (eg, W, = 1/n). Conversely, if 
lim nw, = 0 then there exists a prior » which locally matches F, (and even 
satisfies property (a)), but the posterior is inconsistent in w,-variation, 
lim p,(.N,_) = 0. Thus w,-consistency with w, proportional to 1/n is the strong- 
est possible consistency obtainable from the sole assumption of local matching. A 
consequence of this result is that Bayes estimates of the distribution need only be 
smoothed over intervals of width 1/n to obtain strongly consistent density 
estimates, whereas for ordinary histograms and kernel density estimates the 
smoothing must extend over widths w, satisfying nw, — oo. The proof of 
n~}-consistency is based on finding a sequence of tests of the hypotheses F = F, 
versus the composite hypothesis F ¢ N, such that the probability of error is 
uniformily exponentially small over all F ¢ N, . Such a test is relevant because 
Schwartz (1965) shows that local matching "blus the existence of uniformly 
consistent tests implies consistency. The test statistic is essentially a weighted 
count of the number of empty cells, £4 -.(e7**" 4 empty) 7 1) 

A natural class of priors on densities is obtained by convolving the distribu- 
tions drawn from the Dirichlet with a kernel of random width. These priors have 
been examined by Lo (1984). Here are some open questions. Are these priors 
strongly consistent? Do they match a large class of distributions F? 

Another source of strongly consistent priors are those priors which assign mass 
to a countable set of distributions. For instance, the prior might assign mass to 
every histogram on dyadic intervals with rational heights. Or the prior might 
assign mass to every computable distribution. In either case the prior satisfies the 
relative entropy condition and hence it is strongly local matching for any Fẹ with 
bounded density on compact support. If a prior is root summable, Lpu(F’) < œ 
for some 0 < a < 1, and strongly local matching at F, then the posterior is 
strongly consistent for F. [See Barron (1985b, 1986)]. 


5. On mixtures and consistency. A useful device for incorporating a 
variety of possible prior beliefs is to take a countable mixture of priors. For- 
tunately, the local matching property and hence the consistency is preserved by 
countable mixtures. If at least one of the priors locally matches F, then so does 
the mixture of the priors. 

In an earlier paper, Freedman and Diaconis (1983) showed that mixtures 
involving Dirichlet priors may be inconsistent. In particular, for probability mass 
functions on the positive integers, they considered the mixture of a Dirichlet 
prior (with uniform “stick-breaking”) and a point mass at a probability mass 
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function (i) proportional to 1/i(logi)*. The true probability mass function is 
taken to be @* which differs from ¢ for small i and is equal to ¢ for all large i. 
The posterior has the unfortunate property of concentrating at ¢ rather than in 
neighborhoods of @*. From this inconsistency, we conclude that the Dirichlet 
prior does not locally match 9*. Moreover, the Dirichlet prior assigns zero 
mass to the relative entropy neighborhood {8:L,0*(i)log 0*(i)/@(z) < e} for e 
sufficiently small. 

Freedman and Diaconis have pointed out that ¢ and @* have infinite entropy 
H(6*) = L,0*(i)log 1/6@*(i). One might think that the inconsistency is a result of 
the infinite entropy; however, even if certain finite entropy mass functions are 
used in the construction, inconsistency will still result. It is enough that @* and ọ 
have tails proportional to 1/i* where 1 < a < 4. (The verification of incon- 
sistency closely parallels Sections 2 and 3 of Freedman and Diaconis, 1983). In 
Freedman (1963), finite entropy appears as part of a condition for consistency. 
We now know that the finite entropy assumption is extraneous. It is the relative 
entropy that matters for Bayes consistency. 

In summary we have discussed some inadequacies of the Dirichlet prior as 
revealed by the analysis of Diaconis and Freedman and we have pointed toward 
stronger consistency and merging results obtainable for other priors. 
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The very lucid paper of Diaconis and Freedman is full of stimulating ideas and 
discussion. The ideas fall roughly into three categories: (i) inconsistency of Bayes 
rule, (ii) frequentist-Bayesian interrelationships including the “what if” method, 
and (iii) new Bayesian devices and techniques. My comments will be grouped by 
these categories, and will be restricted (because of space considerations) solely to 
a Bayesian view of the situation. 


CONSISTENCY OF BAYES ESTIMATES 31 


1. Inconsistency of Bayes rules. The fact that parametric Bayesian anal- 
ysis virtually always yields consistent estimators (Bayesian “stable estimation”) 
may have, at one time, lulled Bayesians into believing that consistency was not a 
concern. Freedman (1963, 1965) disabused Bayesians of this notion (or at least 
should have), and this and the following paper provide convincing further 
evidence that Bayesians should be concerned with consistency in nonparametric 
and infinite parametric problems. 

It is important to emphasize the nature of the inconsistency that can arise in 
these problems. Doob’s theorem (see Corollary A2 in the appendix of the paper), 
shows that the posterior mean is consistent for 6 in a set, @, of prior probability 
1. The Freedman and Freedman—Diaconis results show, however, that there are @ 
very close to @, (indeed limit points of @,) for which the posterior mean is not 
consistent. Thus the Dirichlet process prior concentrates on @, = {discrete distri- 
butions}, and for 6 € @, consistency problems are not to be expected (see also 
the commentary of H. Doss), but for 0 £ Ə, Diaconis and Freedman have 
constructed problems where inconsistency can result. 

I feel this shows that a Bayesian has to be especially careful in constructing 
the prior for nonparametric or infinite parametric situations; in particular use of 
“convenient” priors may be more dangerous in nonparametric than in parametric 
Bayesian analysis. One could avoid the inconsistency problem by constructing 
the prior, p, to concentrate on those 0 deemed reasonable a priori (rather than 
settling for a convenient “dense” 6), as the Dirichlet process prior does), or one 
could explicitly worry about consistency of the selected prior, but work is 
involved in either approach. In this respect it should be realized that the 
“tail-free” priors and “neutral-to-the-right” priors are positive steps in the latter 
direction; they are priors for which consistency has been verified in nonmixed 
problems. (Also, Freedman (1963) showed how such priors could be modified to 
incorporate subjective information.) Very little Bayesian research has been done 
on the alternate approach of developing priors which “live” on the right spaces, 
partly because of the calculational allure of priors such as the Dirichlet process 
priors; hopefully such development will now be forthcoming. 

A general question I have for the authors is: How likely is it for one to 
encounter a consistency problem in practice? There are at least two reasons for 
asking this. The first is that many of the difficulties here seem to be caused by 
the concentration of the Dirichlet process prior on the set of discrete probability 
measures. From the beginning, many Bayesians (though certainly not all) have 
been very leery of the Dirichlet process (when used as a prior for all or continuous 
densities) because of this unnatural concentration. The weird occurrences in these 
two papers (and also the commentary of H. Doss), reinforce the notion that it is 
the Dirichlet process prior which is the main problem. There is, of course, the 
Freedman (1965) result, which shows that consistency will only tend to occur on a 
first category set, but the implication of this is unclear since (for instance) there 
exist first category sets on the real line whose complements have Lebesgue 
measure zero. Being consistent, except on a set of Lebesgue measure zero, would 
be quite satisfactory to many. 
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The second reason to wonder about the practical importance of these incon- 
sistency results in nonparametric settings is that, often, the function or distribu- 
tion being estimated is “nicer” than a typical element of the nonparametric class 
being considered; a too large nonparametric class is often assumed for mathe- 
matical convenience. Consistency may obtain at the realistic “nice” functions, as 
it does at the “nice” strongly unimodal densities, h, in the major location 
example of the paper. 

I am not really trying to argue that the examples in these papers are artificial; 
indeed, one of the major strengths of the papers is that they exhibit inconsistency 
in relatively natural problems. Nevertheless, if the authors have developed any 
feel for the chance of encountering inconsistency in practice, it would be nice to 
hear. 

Also, in this regard, the authors refer to analyses by Jeffreys, Fraser, Box and 
Tiao, and Johns, at least some of which are entirely parametric. It is unclear from 
the paper whether it is being claimed that these particular analyses can actually 
be inconsistent, or whether it is merely the case that related analyses, using (say) 
Dirichlet process priors, can be inconsistent. In the first case, there is obviously 
“evidence” that Bayesians are likely to encounter consistency difficulties. 

Before leaving this subject, I feel compelled to also mention the other side of 
the coin. Although I do not think it was the intention of the paper to make any 
“Bayesian versus frequentist” value judgements, some may interpret the paper as 
an argument against Bayesian analysis. Such an interpretation must be tempered 
by the realization that consistency can also be a problem for frequentist. proce- 
dures. Even more to the point from a practical perspective, the advantage in 
most finite sample situations of Bayesian analysis, as opposed to frequentist 
“large sample theory,” is often not appreciated. There is a massive frequentist 
industry which derives large sample asymptotic results, and then “hopes” that 
the results work okay for finite samples. What is not commonly appreciated is 
that Bayes procedures will typically have the same large sample behavior, and 
yet are also probably reasonable for small samples. If one has a variety of 
“equivalent” large sample procedures, why not use one which is also constructed 
to be good for small samples, instead of simply choosing one “at random”? 

Another apect of this “other side of the coin” is that it is precisely in high 
dimensional parametric and even nonparametric problems that it can be most 
crucial to utilize subjective prior information. It will be rare to have enough data 
to illuminate all dark corners of a high dimensional problem, and subjective input 
(including model development) is often unavoidable. As one frequentist-type 
example, consider Stein estimation in nonsymmetric multivariate settings. It is 
fairly well established (cf. Berger and Berliner, 1984) that one cannot avoid 
subjectively determining where and how one should “shrink” the least-squares 
estimator. And in nonparametrics there are often compelling reasons to attempt 
to subjectively specify the rough shape or at least the smoothness of the function 
or distribution to be estimated. Thus, while Bayesians may encounter unexpected 
difficulties (such as consistency) in these problems, the need and incentive for 
Bayesian input is greatly enlarged. 
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These last comments were not meant to prove anything. The point was merely 
to emphasize that frequentist analysis is by no means clearly superior to Baye- 
sian analysis when considering the broad area of utilization of large sample 
theory. 


2. Frequentist-Bayesian interrelationships. There are a large number of 
coincidental and technical relationships between frequentist and Bayesian analy- 
sis, many of which are mentioned in the paper and discussed extensively in the 
references therein. Though interesting, these relationships are not as important 
as the operational issue of when a Bayesian should make use of frequentist ideas. 
The related issue, of when a frequentist must make use of Bayesian methods, is a 
much lengthier topic, and will not be discussed here. (Some examples and 
references to this issue are given in the paper: others can be found in Berger and 
Wolpert (1984), Berger and Sellke (1984), and Berger (1985).) 

The italicized words should and must, in the above paragraph, reflect my 
beliefs that a Bayesian can sometimes utilize frequentist ideas to make life easier, 
whereas a frequentist is often forced by reality to completely abandon ship. No 
attempt will be made to support the latter part of this statement, but I will 
digress to discuss the robust Bayesian motivation for the first part of the 
statement. This digression is somewhat out of place here, but my subsequent 
comments on the Bayesian uses of frequentist measures that are proposed in the 
paper would be otherwise unintelligible. 

The robust Bayesian position can be roughly stated as follows: An answer to 
a statistical problem is a good answer only if there is substantial reason to 
believe that the answer would approximately equal the posterior Bayes answer 
for any reasonable sampling model and prior distribution (and loss function in a 
decision problem) entertained. Thus, suppose it is roughly felt that X is N(9,1), 
that 6 is N(0, 1), and that the loss in estimating @ is increasing in |@ — a|. Then 
the Bayes estimate is a* = }x. If x = 1 is observed, it can be seen that a* = } is 
a good answer, in that small reasonable variations in the model, prior, and loss do 
not change the Bayes estimate much. For x = 5, however, the situation is very 
different. Changing either of the distributions to, say, a similar Cauchy distribu- 
tion will radically alter the Bayes estimate, so a* = 2.5 is not necessarily a good 
estimate. No effort will be made to defend this robust Bayesian belief here; see 
Good (1983), Berger (1984), and Berger (1985) for such defence. 

The most natural way to investigate Bayesian robustness is through what 
Leamer (1978) calls global sensitivity analysis: vary the model, prior, and loss 
over reasonable ranges and see what happens to the posterior Bayesian answer. 
(Recent works in this direction, which contain many other references, are Berger 
and Berliner (1983) and Berger (1985).) 

The point of this aside is that the robust Bayesian definition of a good answer 
does not involve frequentist ideas in any way. The data, x, is always treated as 
known; the “ variables” in the analysis are the model relating x and the unknown 
8 of interest, the prior for @, and any loss to be considered. (Actually, Bayesians 
see little conceptual difference between models and priors. Also, we are consider- 
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ing here only the final “inference about 6” stage of the analysis. In topics such as 
experimental design, the data are not yet known and at least partly frequentist 
measures become necessary.) If robust Bayesians can be satisified that global 
sensitivity obtains, they will look no further. 

From this viewpoint, frequentism may come into play only when global 
sensitivity is unattainable (due, say, to an inability to sufficiently refine the 
usually subjective inputs of model, prior, and loss), or is unverifiable (due to 
technical limitations in carrying out the global sensitivity study). In nonparamet- 
ric or even high-dimensional parametric problems, both difficulties are present 
with a vengeance. It can be hard to perform any sensible Bayesian analysis, 
much less carry out an extensive sensitivity study. There are then various roles 
that frequentist ideas can play. 

The role that is concentrated on in this paper is the “negative” one that bad 
frequency performance is often (but not always) an indicator of a definite lack of 
Bayesian robustness. A lack of consistency, for seriously entertained @, would be 
perhaps the most drastic indication of such a lack of robustness. A number oF 
other such frequentist indicators are discussed in Berger (1985). 

Another frequentist-based tool that is discussed in the paper, as being of 
possible interest to Bayesians, is the “what if” method. Note, first of all, that the 
robust Bayesian viewpoint could be called a “what if” approach; what if the 
model, prior, and loss were changed in reasonable ways? The “what if’ method 
discussed in the paper is quite different, however; it asks “What if we had 
observed different data?” The relevance of this to a robust Bayesian is not clear, 
since the robust Bayesian cares about sensitivity to assumptions only for the 
observed data. Thus, in our earlier simple example, the robust Bayesian can feel 
reasonably satisfied with his answer of a* = } when x = 1 is observed, and will 
not care that he might have been unhappy with his model or prior had he 
happened to observe x = 5. The changes that would be entertained in the model 
or prior, upon observing x = 5, will have little effect for x = 1. The general 
principle is that the Bayesian will not try to protect against features of the model 
or prior that are irrelevant for the data at hand. Note that this is part of the 
fundamental] distinction between conditional and unconditional statistical analy- 
sis, a distinction which, to many, is much more crucial than use of a prior 
distribution. 

We do not here defend the robust Bayesian version of “what. if” as opposed to 
the frequentist version discussed in the paper; the goal has been simply to 
indicate that there is a crucial difference. Also, we would not state that the 
frequentist “what if” method is without value; it is just not clear when it can 
provide insight not available by a prior sensitivity study. In the example of 
inconsistency in estimating the location parameter, I would guess that, for a 
given large sample, the conclusion would be quite sensitive to the choice of the 
prior, so that a prior sensitivity study would reveal the problem. Even if this 
were not the case, one would probably not have to leave the given data to see a 
potential problem; calculating the posterior for subsets of the data would 
presumably reveal the oscillatory behavior of the posterior. (Of course, looking at 
such subsets is somewhat “frequentist what-iffish” in nature.) 
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The two situations in which it is clear that frequentist measures can be useful 
to a Bayesian are (i) when Bayesian calculations are very difficult compared to 
frequentist calculations, and (ii) when developing “automated” Bayesian proce- 
dures for use in (say) computer packages. Even then, interest (to a Bayesian) in a 
frequentist measure occurs primarily when it can be interpreted in a Bayesian 
fashion; the following is a standard example. 


EXAMPLE. Let X denote the random observation in an experiment with 
unknown @, and suppose that C(x) (a subset of @ for each x) is a 100(1 — a)% 
confidence procedure; thus, for all 8, 


(1) P,(C(X) contains 0) = 1 — a. 
If » is a prior on 9, it follows that 
(2) E*P,(C(X) contains 0) = 1 — a. 


Now, a Bayesian would be interested in the posterior probability that @ is in 
C(x) (for the observed’ x, of course); denote this by 6,(x). But it is easy to see 
that 


(3) E™8,(X) = E*P,(C(X) contains @) = 1 — a, 


where m is the marginal distribution of X. But knowing that (3) holds, when a is 
small, is useful information to a Bayesian who has difficulty in working with 
ô (x) directly, in that it then seems very likely (with respect to m) that 5,(x) is 
near 1. And this holds for any u, so that Bayesian robustness seems likely to be 
present. There is, of course, no guarantee that 5,(x) is near 1 for the actual x 
obtained, but there is certainly reason to be optimistic (when a is small). 


For the two situations mentioned before the example, it is easy to see the value 
of (3). Examples exist (see Berger and Wolpert, 1984) where 8, is very difficult to 
calculate for any reasonable p, and yet it is almost trivial to verify (1) (and hence 
(3)). Such examples are somewhat rare, but they do exist. And the attraction of 
(3) in “automated” statistics is that it can impart a feeling of Bayesian robust- 
ness without the need for a sophisticated sensitivity study (which users of 
automated procedures may not be able to perform). Note that (3) need not hold 
for all priors; it need hold only for the class of “reasonable” priors (cf. Morris, 
1983). 

The important distinction in the above use of frequentist measures, by a 
Bayesian, is that there is no desire to involve nonobserved x in the analysis. The 
frequentist measure merely provides a convenient route to a possibly useful 
Bayesian measure. 

As a final comment on “automated procedures,” the development of inherently 
robust Bayesian procedures is an important Bayesian research goal (see Berger, 
1985). One major thrust of the Diaconis—Freedman program can be interpreted in 
this light; namely, the development of priors that are guaranteed to be con- 
sistent. 
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3. New Bayesian devices and techniques. Bayesians are always excited 
by new Bayesian tools, and at least two are discussed in the papers of Diaconis 
and Freedman. One is the technique by which difficult Bayesian calculations can 
be performed by a limiting argument. While related calculations have been 
carried out before, the very general discussion here (cf. Section 4 in the second 
paper) should prove very useful to Bayesians. 

The second new tool is the derivative given in Theorem 4. For parametric 
classes of priors, the study of the derivative of posterior features of interest, with 
respect to the parameters of the prior, has come to be called local sensitivity (cf. 
Leamer (1978) and Polasek (1984)) and can indicate features of the prior that are 
particularly influential and which, hence, may require more careful consideration. 
For the most part, previous work has concentrated on local sensitivity of the 
posterior mean and covariance matrix in conjugate prior situations; the nonpara- 
metric generalizations in this paper (see Appendix B for the relevant formula for 
the posterior mean) are exciting developments. In line with the previous discus- 
sion on “robust Bayesianism,” J am most excited about the use of these 
derivatives to indicate “directions” in which the answer is particularly sensitive 
to the prior input. In terms of the Gateaux derivative, this is somewhat more 
intuitive; letting the prior be (1 — e)p + ev, sending € to zero, and finding » for 
which the directional derivative is largest, may well indicate where additional 
prior elicitation efforts or sensitivity studies should be concentrated. 

An additional attractive feature of using the Gateaux derivative is that it ties 
in well with the most promising formal approach to global sensitivity, which is to 
investigate the range of the Bayesian measure of interest as the prior ranges over 
the “e-contamination” class {u = (1 — e)o + ev}, where py is an elicited prior, e 
reflects the possible inaccuracy in this specification, and y» is some class of 
plausible contaminations. The attractiveness of global sensitivity studies for this 
class is indicated by Huber (1973), Berger and Berliner (1983), and Berger (1985). 
The tie-in with local sensitivity via the Gateaux derivative might lead to a nice 
unification of Bayesian sensitivity theory. 
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The two papers by Diaconis and Freedman which are under discussion contain 
a series of interesting and nicely presented results. The philosophical issues which 
they raise are thought-provoking and merit attention. Their papers also give a 
useful review touching on a number of topics of interest to frequentists and 
Bayesians. 

For simplicity, in the ensuing comments I shall refer to Diaconis and Freed- 
man (1986a) as DFa and Diaconis and Freedman (1986b) as DFb. My comments 
touch on three topics: the technical aspects of DFa, the philosophical implica- 
tions of the results in DFb, and the extension of the “what if” method in DFb to 
Bayesian robustness. 

The model (1.1) of DFa and the accompanying priors seem innocuous, and it is 
somewhat disconcerting that they can lead to inconsistency. Theorem 1 of DFa 
says that the posterior for 0 will fail to converge even though h has a global 
maximum at 0. Theorem 3 states that using a symmetrized prior might not help; 
we can even get the posterior law of the data wrong. On the other hand, perhaps 
the consoling message from DFa is that if log a’ is convex, then in the setting of 
Theorem 1 the posterior for @ will converge. Less helpful is the fact that the 
posterior will converge if the (unknowable) density A is strongly unimodal. 

The discretization results of Section 4 of DFa can be used to approximate the 
solutions to decision problems in the undominated case. In Clayton (1985), I used 
a form of discretization with a Dirichlet process prior to approximate the worth 
of optimal rules for a sequential problem. I conjectured in that paper that 
discretization could be used to construct nearly optimal rules. (The construction 
of optimal rules is practically impossible unless the Dirichlet parameter has a 
finite support.) It seems possible to use the results of Section 4 of DFa to prove 
that conjecture. 

How important is this issue of inconsistency to a Bayesian? I think Diaconis 
and Freedman are right in DFb to consider separately the classical and subjective 
Bayesians, even though many Bayesians have the characteristics of both groups. 
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To a classical Bayesian, a consistent Bayes estimate means that the Bayesian 
will eventually discover the “true” parameter value, and so the Bayesian and 
frequentist will eventually agree. This seems consoling to those ill-at-ease with 
the Bayes/non-Bayes controversy—we might use different methods, but we 
eventually uncover the same truths. 

Why should a subjectivist be concerned with inconsistency? Theorem 3 of DFb 
says that if a Bayesian “A” is consistent, then from a Bayesian “B’s” point of 
view A and B will eventually agree. There are situations, however, where this is 
unsuitable. For example, suppose A and B are witness to some coin tossing. 
Bayesian A is firmly committed to the belief that all coins are fair, and so uses a 
prior 5, ,. for 8, the probability of heads. B is firmly committed to the belief that 
coins are never fair, and uses a uniform prior on [0,4] U [?,1]. Both A and B 
will use Bayes theorem to coherently update their priors as they see data, but 
they will never agree, nor should they. 

Such failures of opinions to merge do arise, at least informally, in discussions 
about issues such as the afterlife, a flat earth, the Bermuda triangle, the cause of 
mass extinctions, and so on. 

In the spirit of the “what if” method, the subjectivist should be interested in 
the effect the prior has on inferences, and should therefore be interested in its 
possible inconsistency. However, while the classical Bayesian would probably 
never want to use an inconsistent prior, the subjective Bayesian might very well 
choose to use such a prior. 

I agree with Diaconis and Freedman that the “what if’ method is a useful 
technique in considering a prior; I suggest that it provides a method of assessing 
the robustness of the analysis to the choice of prior. As Berger has suggested 
(Berger, 1984), it is difficult to specify priors exactly and so we should be aware of 
how changes in the prior will affect our inferences. If a small change in the prior 
results in a small change in the posterior, then this indicates that an exact 
specification of the prior is not critical. On the other hand, if a small change in 
the prior results in a large change in the posterior, then the data have little to say 
relative to the information in the prior. Presumably in this situation one would 
want to be more careful about the choice of prior. 

While Berger (1984) tends to look at gross changes in the prior, I would put 
the emphasis on examining “small” or “local” changes in the prior. The effects of 
small changes of this sort are measured by the derivative of the posterior with 
respect to the prior, or in the notation of DFb, t: Often we will be less interested 
in the entire posterior than the corresponding Bayes rule, and in that case we will 
prefer to look at M,. We can go further: In some settings a small change in the 
prior might result in a big change in the Bayes rule M, but this might not be 
important if the accompanying change in the Bayes risk, R, is small. This leads 
us to look at the derivative of the Bayes risk with respect to the prior, Ro T, and 
È, give “local” measures of Berger’s “posterior robustness” and “ procedure 
robustness,” respectively (Berger, 1984). 

How well does this approach work? In the example i in Section 4 of DFb the 
interpretation of IT, || is pleasing: The posterior of p is most influenced when 2 is 
far from po, relative to o? + oë. This suggests picking a prior for which of is 
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large. It is tempting to try to find “robust” priors for which |j7,\| is small. 
However, lim, ..|7,,|| = œ for #9, x, and o° fixed, which leads to picking a 
prior with small ag. If my calculations are correct, ||M,,|| behaves similarly in this 
regard. 

A possible complaint about using 7, M,, or È, as measures of robustness is 
that only the prior is being called into suspicion; the likelihood is assumed fixed. 
One approach to this problem is to use a large class, C, of distributions for the 
likelihood: Box and Tiao (1973) give an example where a class of exponential 
power distributions is used instead of a normal likelihood. There is another 
approach which leads to a very large class C. 

If we follow de Finetti (1975), then our efforts should concentrate on modeling 
observations, not parameters. Specifically, we should focus on P, the joint 
measure for the observable data X,, X,, X;,.... Suppose data X,, X,,..., Xn 
are collected and a Bayes rule M, is formed. I would argue that from the robust 
Bayesian viewpoint it is appropriate to look at the derivative of M, with respect 
to P. (Depending on how we define the neighborhoods of P, this could Jead us 
back to $) Generally, dM,/dP could be very difficult to compute. A compro- 
mise, which is not likelihood dependent and corresponds to a very large C, uses 
the Dirichlet process to describe the distribution P. The specific P chosen is 
determined by the Dirichlet parameter a. In keeping with the “what if” ap- 
proach, we can ask how the Bayes rule is affected by a small change in a. To be 
more specific, let us suppose that X, = x,,..., X, =x, are observed and we 
want to predict X,,,, with squared-error loss. The resulting Bayes rule is 
p= M/(M + n)up + Ex,/(M + n) where M = a(R), FC) = o(-)/M, and pp = 
{x dF. We may compute the Gâteaux derivative, ji(a, 8), of u, with respect to a 
in the direction of the measure £. To keep the example simple, suppose A(R) = M, 
G(-) = B(-)/M, and tg = {x dG. Then (Serfling, 1980) 

1 


spay lim feala- r+ eG] — pp 





(a, B) = 





M 


This suggests that, “locally,” the only critical aspect in the mispecification of a is 
the mispecification of the mean of an observation. (a, 8) also has the appealing 
feature of being small when n is large. When G = 6,, f(a, 8) differs from the 
influence curve for the mean by a multiplicative constant. 

This approach is not easily extended to a model where P results from a 
mixture of Dirichlet processes, or if another “nonparametric” prior such as the 
tail-free prior is chosen. I am therefore eager to see the results Diaconis and 
Freedman obtain for the Gâteaux derivative in the undominated case. 

Incidentally, the interpretation of IŻ, that follows from Theorem 4(b) of DFb 
is particularly appealing: A measure of sensitivity to the prior is obtained by 
looking at the ratio of objectivist likelihood to D(z). Berger and others (see 
Berger, 1984, page 95 for references) have discussed similar uses of D(u) and the 
likelihood for assessing model adequacy. 
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Finally, a query: In DFb, Diaconis and Freedman use the past tense in 
describing themselves as subjectivist and classical Bayesians, respectively. How 
do they describe themselves now? 


REFERENCES 


BERGER, J. O. (1984). The robust Bayesian viewpoint. In Robustness of Bayesian Analyses (J. 
Kadane, ed.). North-Holland, Amsterdam. 

Box, G. E. P. and Tiao, G. C. (1973). Bayesian Inference in Statistical Analysis. Addison-Wesley, 
Reading, Mass. 

CLAYTON, M. K. (1985). A Bayesian nonparametric sequential test for the mean of a population. 
Ann. Statist. 13 1129-1139. 

DE FINETTI, B. (1975). Theory of Probability 2. Wiley, New York. 

Diaconis, P. and FREEDMAN, D. (1986a). On inconsistent Bayes estimates of location. Ann. Statist. 
14 68-87. 

Diaconis, P. and FREEDMAN, D. (1986b). On the consistency of Bayes estimates, Ann. Statist. 14 
1-26. 

SERFLING, R. J. (1980). Approxtmation Theorems of Mathematical Statistics. Wiley, New York. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF WISCONSIN 
MADISON, WISCONSIN 53706 


A. P. DAWID 
University College, London 


The mathematical beauty and tractability of the Dirichlet prior render it 
almost irresistibly seductive. But beware! Rocks and shipwreck await the poor 
Bayesian navigator captivated by its siren song. Brown (1976) shone a little light 
on these murky waters. Now Diaconis and Freedman deserve the gratitude of all 
explorers for illuminating some of the more treacherous obstacles to a smooth 
passage. 

Beyond these specific warnings, what broader morals are to be drawn? In view 
of the fact that, generically, the pair (8, u) is inconsistent, it is not really 
surprising that the authors can find such a pair. What I find far more surprising 
is the existence of priors p (e.g., tail-free) which are consistent at each 8. Perhaps 
this is only possible because of the rather weak definition of consistency em- 
ployed. Nevertheless, it is an important property, and one which demands further 
characterization. 

Choosing a prior for an infinite-dimensional parameter space is always going to 
be problematical, and any accessible prop (such as consideration of imaginary 
results) should be grabbed. For example, any two different priors are, generically, 
mutually singular, and so involve incompatible world views of what is even 
possible. This can be expected to lead to diverging inferences from the data. The 
mere possibility of consistency, in the problem considered, is therefore an unex- 
pected bonus. 


CONSISTENCY OF BAYES ESTIMATES 4l 


Diaconis and Freedman have only considered i.i.d. observations with unknown 
distribution. Now given any prior, and data (X,, Xo,..., X,,), we can construct 
the predictive distribution for X,,,,. Consistency implies that, as n —> oo, the 
discrepancy between this predictive distribution and the “true” distribution of 
X,,+1 Will, in a suitable sense, approach zero. This property can be extended to 
apply to much more general models for the data sequence, involving stochastic 
dependence and varying marginal distributions, where it has been termed “ pre- 
quential consistency” (Dawid, 1984). (Note that the “counterexample” in Theo- 
rem 1 of Diaconis and Freedman (1986) does in fact yield prequential consistency, 
and so need not be regarded as especially troubling. However, the location model 
with a symmetrised Dirichlet can be prequentially inconsistent.) In these more 
general models, is consistency attainable at all, for sufficiently large parameter 
spaces? For example, can one consistently estimate, prequentially, a process 
known only to be stationary? The arguments of Dawid (1985) strongly suggest 
that, in general, prequential consistency will not be attainable by any method, be 
it Bayesian or not. It would be extremely valuable to characterize problems 
which allow consistency at all. I conjecture that, in any such problem, there will 
exist a consistent Bayesian analysis. 

Diaconis and Freedman seem to imply that their results cast a shadow over 
the use of Bayesian methods, because these can be inconsistent. But so too can 
ill-chosen non-Bayesian methods. Conversely, for the problem considered of 
estimating a distribution, there do exist consistent methods, both classical and 
Bayesian. The moral to be taken away from their analysis is, not that Bayesian 
methods should not be used, but that great care is needed in selecting such a 
method. However, exercise of this care when considering priors less tractable than 
the Dirichlet, whose implications are correspondingly less transparent, is likely to 
pose serious problems of implementation. 


Acknowledgment. I am grateful to Mark Schervish for discussions on these 
points. 
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1. Introduction. Diaconis and Freedman (D & F) have made important and 
interesting contributions to the problem of determining in which situations 
nonparametric Bayes estimates do and do not work. Investigating when statisti- 
cal principles and techniques break down is an important enterprise which is not 
well enough appreciated. Thus, even though we often experience in life that 
nature serves up distributions consistent with Murphy’s Law,” in applied re- 
search, there is a tendency to believe that nature provides nice simplistic 
distributions and models. Economists are starting to realize that this belief can 
lead to large errors in prediction. In other fields, it may take longer to discover 
similar problems. 

We find it both surprising and interesting that inconsistency can occur when 
the prior on the location parameter, the Dirichlet parameter in the law of the 
error distribution, and the distribution sampled, are all “nice” and symmetric 
about zero. D &F (1986a (hereafter I), Section 3) suggest using the “device of 
imaginary results” or the “what if” method to deal with the inconsistency. This 
procedure involves modifying the prior after computing the posterior for “imag- 
inary data sequences.” In Section 2 below, we discuss the properties of a different 
and much simpler (subjectively speaking) approach which amounts to computing 
a posterior distribution based on partial information or to presmoothing before 
computing the posterior. In Section 3 we show that this “partial posterior” idea 
can be linked to partial likelihood. 

But first we will focus on the following intriguing D & F statement: 


Any of the classical estimators, such as the mean or the median will be 
consistent in this situation, so the Bayes estimates do worse than available 
frequentist procedures. (D&F I, Remark 4, Section 1. See also D&F 
(1986b; hereafter IT), Section 1.) 


This statement refers to models where the “Bayes” procedure is given the job 
of coping with the infinitely dimensional nuisance parameter F as well as location 
while the “frequentist” procedure essentially only has to deal with location since 
any nuisance parameter difficulties have been removed by assuming symmetry. 
Thus we think that a fairer comparison would be the nonparametric Bayes 
procedure versus the semiparametric frequentists procedure where the pair (8, F) 
is estimated using semiparametric maximum likelihood techniques. 

Rather than pursuing this last remark, we claim that the D & F results lead to 
the conclusion that what is needed in the nonparametric framework are Bayes 
procedures for location that are not distracted by the problem of dealing with an 
infinitely dimensional nuisance parameter. Thus we propose using the posterior 
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distribution of @ given the trimmed sample mean as the basis for Bayes proce- 
dures for location. More generally, we can condition on any estimate T of 
location. In a sense, this corresponds to focusing on location and smoothing to get 
rid of distributional unpleasantness that distracts from the location problem 
before computing the posterior. For instance, for the counterexample density h 
(D &F I, Figure 1), X is nearly normal already for n = 12. 

In the next section we discuss the consistency, asymptotic normality, and 
efficiency of such procedures and find that they have high posterior efficiency. 
Moreover, these procedures can be justified on pragmatic grounds. To paraphrase 
C. Eisenhart (see Tukey, 1954) “the practical efficiency is the product of the 
statistical efficiency of the technique and the probability that the technique will 
ever be used.” Our proposed procedure is very simple: (T|@) is approximately 
normal; thus with a normal prior, #(6|T') will approximately be the usual 
normal theory posterior. Moreover, 2 (9|T ) will inherit the robustness properties 
of T. 


2. Robust and consistent Bayes procedures. Let X,=@+6,12= 
1,...,7, where e,,..., €„ are iid. with fixed distribution function F. Thus given 
6, X, has distribution F(x) = F(x — 6). Let @ have bounded and continuous 
prior density 7(6). We suppose that T is a translation invariant estimate of 8, 
Le, T(x + c) = T(x) + c, c E (— œ, œ), and that T > 0 as. [F]. Moreover, we 
assume that ¥(/n T|0) > 1(0, 0°(F)). 


THEOREM 1. Under the above conditions, the posterior distribution of @ 
given the partial information provided by T converges to the W(T, n7'0*(F)) 
distribution. More precisely, if 1(8,) > 0, then £(/n(@ — TIIT) > N0, 0°(F)) 
a.s. [Fp] 


The proof will appear in a forthcoming paper by the authors. 

Note that F is not assumed to be symmetric. If T is the sample median, the 
conditions will be satisfied if F has median zero. 

For two posterior distributions £(6|T,) and £(6|T,), it is natural to define 
the Bayes asymptotic relative efficiency (BARE) as the ratio e,(T,,T,) = 
o3(F')/o2(F) of the posterior variances in Theorem 1. For instance, if T, is the 
normal scores estimate (Hodges and Lehmann, 1963) and T, = X, then e,(T,, To) 
> 1 for all F symmetric about zero. Thus, in terms of BARE, the normal scores 
Bayes procedure is uniformly more efficient than the familiar Bayes procedure 
based on #(6|X). Similarly, let T, be an adaptive estimate of 0 when F is 
symmetric (Stein, 1956; Stone, 1975). Then modulo proving strong consistency of 
T;,, £(8|T;) is an adaptive Bayes posterior with asymptotic posterior variance 
oŻ( F) = 1/nI(F), where I( F) is the Fisher information. Thus, from the point of 
view of asymptotic Bayes theory for the location of a symmetric distribution, 
conditioning on an adaptive estimate may make more sense than putting a prior 
on F. 


REMARK 2.1. For the above model, Lo (1984) has shown that E(@|X) > b 
a.s. [Fg ] and £(6|X) > ôa as. [Fy ] in the undominated case, where 4, is point 
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mass at 4). The results of D&F show that these results fail to hold if Fy, is 
replaced by their counterexample distribution H (D & F I, Figure 1). 


Next we turn to the model where F is also random with a Dirichlet distribu- 
tion D(a). We want to illustrate that by using a posterior based on partial 
information, we can get consistency. To this end, let T denote the sample median. 
Since a/a( R) and the counterexample distribution H (D&F I, Figure 1) both 
have median zero, then T > 0 as. [a/a(R)] and T > 0 as. [H]. Let a,(t) = 
a((— œ, t — 0])/a(R) and H,(t) = H(t — 6), then T > 4 as. [ag ] and T > 6, 
a.s. [Hy |. These results can be used to show: 


THEOREM 2. With the above conventions, if 1(9)) > 0, then £(8|T) > ôs, 
a.s. [H ]. If 0z(0) is bounded, then E(6|T) > b a.s. [H]. 


The proof will appear in a forthcoming paper by the authors. 

We think that the results of this section in conjunction with those of D & F 
give compelling reasons for flexible Bayesians who desire robust procedures to 
consider posterior distributions based on partial information. It is a matter of 
giving a little (give up Bayes efficiency for some idealized model) in order to gain 
a lot (high Bayes efficiency and consistency over a wide class of models including 
the D & F counterexample distribution). How much is gained (and lost) should be 
the focus of research of pragmatic Bayesians everywhere. 


REMARK 2.2. In the D&F spirit of “true confessions” (D & F I, Section 3), 
what are we anyway? The answer is: Statisticians! Beyond that, one of us is a 
Bayesian, but he will no doubt be excommunicated as a heretic after the above 
remarks. 


3. The history of partial information. Cox (1972,1975) introduced the 
idea of partial likelihood. Kalbfleisch and Prentice (1973) showed that in many 
interesting cases, the partial likelihood coincides with the marginal likelihood and 
the rank likelihood. The latter is the probability distribution of the rank vector 
considered as a function of the parameters and had been used earlier by 
Hoeffding (1950) to generate optimal rank procedures. Savage and Saxena (Savage, 
1969) proposed using the posterior distribution given the ranks in a nonparamet- 
ric Bayesian context. This is also a form of smoothing before computing the 
posterior: The ranks map the data set — 150,6, —2,3 into —4,3, — 1,2. In fact, 
for a sample from the D&F counterexample distribution H, the signed ranks 
have a uniform distribution over the space of possible signed ranks (e.g., Bickel 
and Doksum, 1977, page 360). Recently, Pettitt (1983) has proposed useful 
approximations to the posterior given the ranks. 
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It is generally acknowledged that it is hard to think about priors in high (but 
finite) dimensional spaces. Subjective Bayesians know that it is hard to elicit a 
prior from an individual when the dimension is 3 or 4. Diaconis and Freedman’s 
interesting results on an inconsistent Bayes rule involving a reasonably natural 
prior show how far off our intuition can be when we pass to an infinite 
dimensional setting. In this discussion, we present other peculiarities, in addition 
to the inconsistent behavior, that arise when one uses the symmetrized Dirichlet. 
prior. The discussion concludes with a few remarks on an alternative way of 
constructing priors on c.d.f.’s. 


1. The symmetrized Dirichlet priors. The setup considered by Diaconis 
and Freedman is the following: 


X,=0 +e, i=1,2,...,n, e, are i.i.d. ~ F. 


The parameters # and F are independent, @ having a density f, and F being 
distributed according to 9,, with a absolutely continuous. 

Let 6,,= 3(X,+ X,), and let #(6,,) denote the number of distinct pairs 
(Xp, X1) such that 7(X, + X,) = 6,,. (The pairs ( Xp, X,) and (X,, X,) are called 
distinct if the sets {X,, X,} and {X,, X,} are distinct.) The number #(f,,) will 
be called the multiplicity of 6,,. The posterior distribution of 0 given Xj,..., Xn 
is denoted 7,,. 
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If max, ,#(6,,) = 1, then 7, has been described by Diaconis and Freedman 
(1986, Lemma 3.1). 

Doss (1984) shows that if max,, ,#(6,,) > 1, then 7, is a discrete measure, 
concentrating all its mass on the points 6,, of highest multiplicity. (An intuitive 
explanation of this is in Section 3 of Doss (1984); the corollary in that section 
gives an explicit formula for 7,.) In particular, if there is a unique point of 
highest multiplicity, then 7, is a point mass at that point. This qualitative result 
is true independent of the parameter a. 

Several observations can be made immediately from this result; it is easiest to 
proceed by way of example. Consider a data set consisting of 104 points, as 
follows: X,,..., Xio are between —1 and 1, and #(6,,)=1 for 1 <i, j< 100; 
the last 4 points are 3, 6, 10, and 13. Under reasonable conditions on a and f (e.g., 
a and f are both standard normal), 7,9), will have most of its mass between — 1 
and 1. Note, however, that 7\,, is a point mass at 8 because 8 has multiplicity 2. 
Not only does the posterior undergo a drastic change because of a few additional 
observations, but we have a situation where the posterior (7,9,) is singular with 
respect to the prior. To a subjective Bayesian, this is very counterintuitive: A 
finite data set completely changes the opinion of the statistician. 

Next, suppose that one of the last four points is perturbed very slightly. The 
result is that the posterior is no longer a point mass at 4. Thus, we see that the 
map X > 7, is highly discontinuous. (Compare with Diaconis and Freedman’s 
discussion of the derivative of the map taking priors to posteriors.) This is 
unusual behavior and is disturbing when one considers the possibility of rounding 
and/or grouping of the data. In light of the “what if’ method discussed by 
Diaconis and Freedman, this raises questions about the use of the prior 2,- 

Diaconis and Freedman have examined the asymptotics of 7,, when the data 
X,,..., X, are iid. from a continuous distribution. The form of the posterior 
when max,, ,#(6,,) > 1 and the observations made above raise the question of 
how 7, behaves when X,,..., X, are iid. from a symmetric discrete distribu- 
tion. The answer is that 7, can perform extremely well. 

Consider a discrete distribution H which is symmetric about 6, and suppose 
that H has just a finite number of atoms, say at 6+ a, for i = 1,..., k. Let 
{X,} be iid. ~ H. As soon as the values 644,,i=1,...,% have all been 
observed, @ is a midpoint of multiplicity k, while all other midpoints have 
multiplicity less than k. The posterior 7, is then a point mass at 0, and no 
additional observations can change it. Furthermore, the probability that at least 
one of the values @ + a, has not been observed by time n goes down exponen- 
tially with n. This can be generalized to other types of discrete symmetric 
distributions. For example, the result is still true (with probability 1) if H is 
picked according to 9,; see Doss (1984). Thus, the posterior (and its mean) can 
behave extremely well if the data come from a discrete distribution. 

Suppose that X,, X,,..., X, have a unique midpoint 6* of highest multiplic- 
ity. Since the posterior distribution of 6 is a point mass at @* regardless of the 
parameter a, it follows that if the prior on F is a mixture of symmetrized 
Dirichlet priors, then the posterior distribution of 8 is still a point mass at 6*. 
Thus, the entire discussion above applies to mixtures of symmetrized Dirichlet 
priors, Dalal (see, e.g., 1979) has shown that such mixtures are dense in the set of 
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all priors on symmetric F'’s, with the weak topology. These considerations raise 
questions as to the appropriateness of approximating priors on symmetric F'’s 
with mixtures of symmetrized Dirichlet priors, since it is obvious that it is the 
“details” of these mixtures (viz. the fact that they give probability 1 to discrete 
symmetric c.d.f.’s) that is causing the peculiar behavior. 


2. An alternative construction of random c.d.f..s. The atoms in the 
posterior distribution of 0 arise because the construction of random c.d-f.’s used 
by Diaconis and Freedman produces symmetric discrete c.d.f.’3 with probability 
1. The atoms vanish if instead we proceed as follows. Let a be a finite symmetric 
(absolutely continuous) measure on £, and let a, and a_ denote the restrictions 
of a to (0, co) and (— 0,0), respectively. Choose F, and F, independently from 
@,, and P, , respectively, and form 


F(t) = F(E) + 2F(¢). 


This F has median equal to 0 and with probability 1 is not symmetric, although 
it is symmetric “on the average,” e.g., 
EF( t) 2 ay 
llall 

If we use this construction, the problem becomes one of estimating the median. 
This is done in Doss (1985a, b), which contain the details of the rest of the 
discussion. The Bayes estimate Ê? of 0 can be described as follows. Let 6” 
denote the maximum likelihood estimate of 0 when the distribution of the e,’s is 
assumed to be a/l|a||. Then, 6” is essentially a convex combination of 6” and of 
the sample median, with the weights depending on the sample. 

Suppose that {X,} are iid. from a distribution function H with unique 
median equal to 6. If H is discrete, then the Bayes estimate 67 is consistent. 
However, if H is continuous, then the Bayes estimate can be inconsistent: It can 
converge to A wrong value, it can oscillate indefinitely between two wrong values, 
or the set {ĝ,} can be dense in 2. As before, this behavior can be traced to the 
fact that the Dirichlet priors give probability 1 to discrete distributions. 
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Consider the following simplified version of Freedman’s (1963) example show- 
ing inconsistency of the posterior distribution for countable parameters. 
Let @,, denote the geometric distribution 


p(i) =3/4'*', i=0,1.... 
Let 6, denote the truncated geometric distribution 
p(t) =38/4'*',  i=0,1,...,k, 
p(i) =0, i=k+1,..., 
pli) =i". i=-l. 
Let 0, denote the geometric distribution 
pli) =3'/4'*?,  i=0,1,.... 


A prior distribution gives probability II, to @,. The likelihood of 
X,, Xas... X, at 0, is (2)"(4)* for 0 < k, k > max X, and is (1)"-(2)* at 
k = 0, where S, = LX,. 


The posterior probability of 6, is 


mas (1,8 + } I) 

k>max X, 
When 8, is true, S, = jn + O(n), max X, ~ log,n. Let Q, = Ep I. Choose 
the prior II so that II, > 0 all k, 3"Q(log,n) > 0 as n > œ. Thus 6,, is true, 
and every neighborhood of 6,, has positive probability in the weak-star topology, 
yet the posterior probability of 8, converges to 1. 

From this and other arguments the authors conclude that posteriors are 
usually inconsistent when the parameters are countably dimensional. But note 
that the parameter space here is just the integers k =0,1,...,00 (and in 
Freedman’s example it is a closed interval). The counterexample applies equally 
well in finite dimensional cases if one induces a topology on the parameters by 
the weak-star topology on distributions! 

The weak-star topology gives little weight to the tails, but small differences in 
the tails can have very large relative effects on the likelihood, and therefore on 
the posterior probabilities. I wonder if it wouldn’t be possible to escape from the 
inconsistency by using a likelihood-friendly topology such as one based on the 
distance 

ia P, 
p(P,Q)= X} (P,- Q)? 
1m0 t 
In this topology, the @, of the example are isolated points, and 6 „ has zero 
probability, so there is no surprise in finding the posterior inconsistent. 
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1. Introduction. Although hardly the authors’ intention, these papers by 
Diaconis and Freedman (D & F) will probably be read by many as criticism 
against and pessimism about Bayesian analysis in situations with high-dimen- 
sional parameter spaces. It is perhaps also easy for the statistician scanning the 
papers to get an impression of “just counterexamples,” which would be unfair; 
these and earlier papers by D&F (or F&D) contain many new important 
statistical ideas and also useful mathematical techniques. 

I will try to be (more) positive and hope to show that thinking Bayes in semi- 
and nonparametric models may be a worthwhile enterprise, sometimes giving 
additional insight into old problems, and sometimes (dare I say often?) leading to 
sensible Bayes procedures that also behave agreeably in the frequentist asymp- 
totic sense. The bulk of my comments concerns a problem that is almost as old as 
statistics itself, that of fitting a parametric model to a data set, and that can be 
attacked again with ideas underlying some of the constructions of D&F. Let 
Xs... X, be a sample from some unknown distribution F with density f. Some 
(possibly crude) parametric family {F}, fọ: 6 E O} is then forced on the data. 
Textbooks teach us how to proceed, for example, advocating finding the maxi- 
mum likelihood estimator Osis on the grounds of good asymptotic behavior, in 
particular, consistency. What very few textbooks tell us, however, is what wz 
does when the model is wrong, i.e., there is no 8, with f = fa, It is however not 
difficult to see that 6,4, still is a meaningful estimator in that it takes aim at the 
parameter value 6 = 0, that minimises Kullback—Leibler “information distance” 


(1) KCF: fe) = ff los t/a) ax; 


the log likelihood divided by n is a consistent estimate of ff log fdx — K f: fa). 
Under appropriate conditions bua. is consistent for this “least false” parameter 
value. Hjort (1985a, Chapter 3) has further comments about the behavior of 
maximum likelihood machinery when the model is wrong. 

One of the major uses of a fitted model is prediction, or probability assess- 
ments, for certain sets. Thus we could be interested in stating that approximately 
90% of future Xs from a fitted normal will fall in (@ — 1.6456, fi + 1.6456), or 
that approximately 50% of future data points from a fitted Weibull fall below 
6(log2)'/*, etc. If such statements are an important part of the statistical 
analysis, then there are disadvantages to using Âw, Gy» resp. Îmz, Ay, in the 
case of an incorrectly specified model, and one could do better with other 
estimates that aimed at other versions of least false population parameters. It is 
the aim of the present notes to show that such least false parameters can be 
defined and that a suitably engineered semiparametric Bayesian setup can result 
in estimates that actually manage to estimate these. 


! This work was done while the author visited Stanford University with grants from the Norwegian 
Computing Center and the Royal Norwegian Council for Scientific and Industrial Research. 
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If we acknowledge uncertainty about the chosen parametric model we should 
perhaps build that into a larger statistical model. A natural Bayesian approach is 
to give a prior density »(@)d6@ for 0 in © and some prior on the space of 
distributions on the sample space, centered at { fọ: 0 € @} in some sense. Such 
ideas are really behind much of the work presented in D & F. One way of doing 
this is the following: Assume for the moment that the X,’s are univariate and 
write 
(2) X, =F (U), j=1,...,0, 


J 


where U, has distribution G = FF, '. That the parametric model is correct 
amounts to having G = G, = U(0,1) for some 8. Uncertainty about the paramet- 
ric model can therefore be modelled by a prior distribution for G (in the space of 
distributions on [0,1]), centered at G}. An “uncertain Gaussian model,” for 
example, is 


(3) X,=gp+0Y, j=1,...,n, 


where Y, = ®~'(U,) has a random distribution centered at the standard normal. 
To help identifiability one could restrict the space of allowable distributions for 
Y, to those having zero mean and unit variance, or to those being symmetric with 
interquartile range 2 X (0.674), etc. The latter approach would be along the lines 
of D&F. 

The next section outlines another but related approach, still with the notori- 
ous (“herostratic” would be too harsh, even with D & F’s examples) Dirichlet 
process prior as a building tool, and is more akin to recent work of Hani Doss 
(1985a, b). The method offers the possibility of building uncertainty about any 
parametric model into a larger semiparametric model, and allows one to specify 
control sets that may be important for later predictions based on the fitted 
model. The asymptotic results of Section 3 are of the same character as those of 
D&F and of Doss, and indeed, examples displaying “inconsistency” can be 
constructed. They are interpreted in a more positive light here, however. It will 
be seen that the resulting Bayes estimates really take aim at, and will be 
frequentist consistent for, completely sensible least false parameter values. Also 
included in Section 3 are indications of asymptotic normality results and calcula- 
tions of influence functions. It emerges that the Bayes estimates, or for that 
matter closely related frequentist estimates, constitute robust alternatives to 
traditional estimates, with the advantageous capability of being flexibly tailored 
to any specific prediction task, and without losing much efficiency in the idealised 
(and unrealistic) case when the parametric model happens to be correct. 

Section 4 briefly sketches some ideas for similar Bayesian semiparametric 
analysis in parametric survival analysis models, where the class of beta processes 
plays the natural role. Section 5 contains additional remarks. 


2. Semiparametric Bayes estimation. Having (2) and (8) in mind, write 
for a general iid. sample X,,..., X„ in some X space 


(4) X,=A,fY,), j=1, nn, 
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where A, is one-to-one on some Y space. Y, has distribution G; if G is equal to 
some idealised G, then X, has distribution F, = Gohg! with density f,(x) = 
Bol Rg (x) dg (x)/ Axl, say. 

A first construction for a prior distribution for (8, G) could be to let 8 ~ »(@) dé 
and G, independently, be a Dirichlet process centered at G). Write G ~ Dir(kG,) 
for such a process, with “strength of belief” parameter k and “prior guess” Go. It 
turns out that @ has posterior density 


(5) y(Blx) = o(x) in * jalz,)0(8), 


the * signifying that only the distinct observations are to be included. The only 
effect of the sophisticated extra randomness introduced by G ~ Dir(kG,) is that 
the likelihood is only over the distinct data points. (c(x) denotes generically a 
function of the data x = (x,,...,x,) that gives integrated posterior density 1.) 

A more fruitful approach is the following, generalising Doss’ method 
(1985a, b). Define m control sets B,,..., B,,, constituting a measurable partition 
of Y space, with G,(B,) = z,, i= 1,...,m. Now pin down a G ~ Dir(kG,) by 
conditioning on G(B,) = z,, i= 1,..., m. It can be seen that G splits into m 
separate and independent Dirichlet processes: 


G=2,G, ontheset B,,where G,~ Dir(kz,(G,/z,)) on B,. 
Hjort (1985b) obtains 


(6) volz) = o()M(x, 8) T fla, )(8) 

for the posterior density of 6, where 

(7) M(x,8) = [1 26/T(ke, + C,(8)) 

and T 

(8) C6) = Ex, = he¥,) € hoB,} = nF,(hoB,), 


writing F, for the usual empirical distribution of the sample. M(x, @) is large for 
values of # that make C,(8) close to nz, i.e, F (haB,) close to z, i= 1,..., m. 


EXAMPLE. Fit a normal (u, o?) to data. Assume it is of interest to have 
approximately 25% of future data points in each of the four categories (— œo, Â — 
cé], (Ê — cô, R], (ñ, à + cô], and (R + cô, 00), for c = 0.674, which would be the 
case for each perfectly normal underlying F, but not, for example, for even 
slightly skewed F, if yg, Gyq, are used. The 25-25-25-25 goal could be important 
for prediction purposes, or just considered a pleasant aspect of the theoretical 
parametric model worth preserving to some extent for the fitted model. Define 


52 DISCUSSION 


control sets B, = (— œ, —c], B, = (—c,0], B} = (0, c], and B, = (c, œ). Then 


M(x; p,0) = oiran + Cp, °)| : 
where 
Ci(p, 0) = nF,(u — co), Co(u, 0) =nF,(4 — co, p], 
C3(u, 0) =nF,(u, u + co], C,(u,0) = nF,(u + ca, œ). 


The posterior density (6) would in this case have two peaks, one corresponding to 
M(x; wu, 90), trying to achieve the 25-25-25-25 splitting, and one for the usual 
factor TT; f,, «(x,), which makes efforts to get the population mean and popula- 
tion variance correctly estimated. Since these goals coincide only in the idealised 
Gaussian case the Bayes estimators based on (6) try to push fy, Ôm; 80 as to 
better achieve the stated 25-25-25-25 goal. 

The theory allows multidimensional data, and extends to non-i.i.d. situations, 
for example regression models. Further examples are in Hjort (1985b). 


3. Asymptotic behaviour of the estimates. Assume that the X,’s come 
from a continuous F with density f. Until D &F came along one would have 
expected the parts of the posterior density (6) that stem from the fixed, chosen 
prior distribution to be washed out by the data as n tends to infinity. »(@) indeed 
ceases to be important even for moderate n, inviting subject-independent 
Jeffreys—Box—Tiao-style choices for this parametric part, but the nonparametric 
part M(x, 8) turns out to match the Fisherian part IT?_, /o(x,) in importance. 

A Stirling approximation shows that 


Diepe + mp,) = Bem? pA 


rm tes] 

where B, = (27) ™/exp{n — (n + k — 4m)logn} is independent of the prob- 
ability vectors p=(p,,...,p,) and z=(z,,...,2z,,). Here [p:z)= 
Lr: p,log( p,/z,) is the Kullback—Leibler distance from p to z, cf. (1), and is 
convex in p with a unique minimum at p = z. From (6) and (7) we get the 
approximation 
(9) v(x) = c(x)e~2@ (8) 
for the posterior density, ignoring some lower-order terms, where 

m 1 n 
(10) QE, 6) = E F,(hgB, log( F,(heB,)/z,) ~ m 2 log fa(x,)- 

t=1 g=l 
The posterior density is concentrated where Q(F,, 0) is smallest, and the Bayes 
estimator Ê should asymptotically behave as 
(11) 6 = ¢(F,) = the 8 minimising Q(F,, 0). 
Under regularity conditions, therefore, both 6 and 6 are frequentist consistent 
estimators for 
(12) 6, = ¢(F) = the 0 minimising Q(F, 6), 


CONSISTENCY OF BAYES ESTIMATES 53 


where 


Q(F, 0) = Y FlhpB, log F(45B,)/2,) — flog folz) dF (2) 
(13) at 
= I((FhoB,) r1: (21)1) + ICF: fo) — ff log fae. 


0, enjoys interpretation as a least false parameter, and lies intuitively somewhere 
between the 6, that Êw, aims at, discussed around (1), and a third variant 6, 
that minimises the first term in Q(F, 8), aiming at getting (Fh,B,)?, as close to 
(z,)7., as possible. 

When f really belongs to the parametric family, say f = fy, then 84, 0, and @, 
all coincide with the (then true) value 4. 

When F is discrete, only o(n) of X,,..., X, are distinct, a.s., and the first 
term in Q(F,, 0) dominates. Hence 6 is consistent for 6, in this case! 

Hjort (1985b) gives conditions under which n!/*(8 — §) > 0 in probability. 
Limiting properties of n’”*(6 — 6,) can therefore be investigated by studying the 
functional @ above. Its influence function can be calculated. Preliminary work 
indicates that these are of the robust type (more cautious than for $y, in typical 
models) and that a reasonable efficiency is retained in the idealised case j = fg, 
To cite but one example, fi, 6 constructed as in the example of Section 2 have 


n(A- yu 1.0710)” 0 
ie A mls (0); ! i 2 
n? (ê — o) 0 (0.8660) 
in the idealised Gaussian case. This compares well with the optimal covariance 
matrix diag(o?, o?/2). 


4, Semiparametric Bayesian analysis of survival analysis models. We 
still need more experience in and knowledge of the consequences of Bayesian 
analysis of semi- and nonparametric models. Frequentist asymptotic analysis 
seems to have been restricted to cases where the Dirichlet process, in various 
disguises, has been the prior. Another testing ground could be models in survival 
analysis with censored data, where manageable prior processes other than the 
Dirichlet are available (cf. Ferguson and Phadia (1979) and Hjort (1984a)). (The 
classical results about maximum likelihood and Bayes analysis for iid. frame- 
works with finitely many parameters carry over to say counting process models 
with censoring; see Hjort (1984b).) 

Beta processes are introduced in Hjort (1984, 1985c) as natural priors for 
cumulative hazard rates in nonparametric models with censoring. Suppose for 
example that a crude model specifies a constant rate @ for transitions from state 
s to state s’ in a (possibly time-inhomogeneous) Markov chain. A semiparametric 
supermodel could structure the underlying unknown cumulative hazard A(t) via 
1 — dA(s) = (1 — dB(s))*, where B(-) is a beta process centered at B,(-), Bo(t) 
= t; there is also a strength of belief parameter function k(-) to be specified. If 8 
is also given a prior the posterior density of 0 can be worked out. Preliminary 
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investigation indicates that the Bayes estimate 6 converges to a value that 
depends on the chosen function k(-); this in contrast to results for the method 
outlined in Sections 2 and 3. 

As another example, consider Cox’s regression model. Imagine individual i 
having its own cumulative hazard A,(-), and assume proportional hazards 
1 — dA,(s) = (1 — dA(s))° "440, where z, is the covariate vector, and A(-) is 
close to having unit rate A(t) = t. As above, a prior on (8, 8) can be given, and 
A can be taken as a beta process centered at A, with strength of belief parameter 
k(-). The posterior density of £ can be handled. The Bayes estimate Ê is close to 
the usual Cox estimate for k(-) close to zero, and is close to the maximum 
likelihood estimate based on the (8, 8) model with A = Ay when k(-) is large. 
The asymptotic fate of Å is unclear for intermediate choices of k(-). I hope to 
pursue these matters later. 


5. Additional remarks. 

(a) D& F state that the Bayes estimates do worse than available frequentist 
procedures, e.g., the empirical median M, (D & F (1986), Section 1, Remark 4). 
Bayes procedures that match M, in performance can be constructed, however, if 
the problem is just this, i.e„ estimating the true median. Let 6, be the Bayes 
estimate (posterior expectation) based on any Dirichlet process prior Dir(kF)). 
Then ĝ, is close to the interesting estimator 0* = £”, 77 1)(5)"" xq), assuming 
Xqy < t < Ximi Â, — OF 7,0 if k/n > 0, and n}/(6, — 6*) >, 0 if k/n? > 
0. Also, n'/?(8* — M,) >p0. These statements are valid with some restrictions 
on the tails of the underlying continuous F. 

(b) The work of D&F, and the present contribution, can be seen as an 
attempt to construct Bayesian robust procedures qualitatively similar to those 
recently worked out by Beran (1981), Millar (1981), and others: full efficiency at 
the parametric model and Le Cam-type robust optimality in a (shrinking) 
neighbourhood. The construction of Sections 2 and 3 above seems to manage this 
only in the not very satisfactory asymptotic framework where k/n — œ, k being 
the prior sample size parameter, cf. (6), where M(x, @) is dominated by II? fg(x,) 
under this assumption. 

(c) The semiparametric Bayes estimates constructed in Sections 2 and 3 have 
frequentist relatives that behave equally well (?) asymptotically. The parameter 
k must be specified by the user in (6) in order to compute the Bayes estimate; 
however (i) the asymptotics are independent of k, and (ii) it would also be 
possible to estimate k from the data. k large means a good fit to the parametric 
model. 

(d) I welcome papers such as these (D & F) and Rubin (1984), discussing points 
of overlap and of mutual interest for Bayesianism and frequentism. I agree with 
Diaconis (1985) when he observes that the controversy seems to have lost its 
power to polarise. I also agree with a slight variation of another statement in 
Diaconis (1985): We should focus on the coming controversy—with those who 
think the computer has taken over. 
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WILLIAM S. KRASKER AND JOHN W. PRATT 
Harvard University 


This is one in a series of fascinating papers. They are easily read as painting a 
picture of modern Bayesianism in bad trouble but frequentism in fine shape. A 
larger historical perspective gives a different view, however. It should therefore 
be sketched, even if only impressionistically. 

Time began in darkness and “inverse” probability. Then the Rev. Thomas 
Bayes let in some posthumous light. The postulate he identified and used was 
duly found arbitrary or ambiguous, and unfounded. Likewise Fisher’s reference 
sets. Then Neyman and Pearson developed “objective” (frequentist) concepts 
even as Ramsey and de Finetti were proving that “subjective” Bayesianism was 
the only coherent theory possible. Soon (well before Pratt’s 1961 and 1965 
surveys) objective methods too were found arbitrary and theoretically and 
practically deficient even in the simplest situations (where uniformly most 
powerful tests are randomized). 
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But the new, true Bayesian road, though well lit, is not always smooth. Prior 
distributions are hard to assess, practically impossible sometimes. “Reference” 
priors, though convenient and objective, (subject-independent), sometimes com- 
mit theoretical and even practical improprieties, some of which—we admit 
it—frequentism identifies. Bayesian methodology (not theory) has progressed 
least in many-parameter inference. With hierarchical structures, it has done well, 
extending and clarifying the frequentist models and analysis of random effects. 
But with nonparametric structures, including ordinary sample surveys, it has 
done poorly. 

Suppose, for simplicity, that x,, x.,... are iid with continuous distribution F, 
possibly multivariate, and are observed infinitely accurately. A nonparametric 
Bayesian might expect his true prior and want an approximate prior to have 
these properties, among others, at least for “well-behaved” samples. 


(1) P(F is continuous) = 1. 

(2) P( Ê, is continuous) = 1, where Ê, is the expectation of the posterior distribu- 
tion of F, the predictive distribution of x,,,, given x,,...,X,- 

(3) Ê, — F, is of order 1/n in a suitable sense, where F, is the empirical 
distribution. 

(4) A posteriori F(x) ~ N(F (x), F(x)[1 — F{x)]/n) to order 1/n. 

(5) In intervals of order 1/n, the information in the sample spacing is dominated 
by prior expectation of smoothness. 


The reason to expect (3) and (4) is that, even in finite-parameter problems, a 
positive prior density affects the posterior distribution only by order 1/n, and 
the likelihood dominates the prior in determining the whole posterior distribu- 
tion, not merely the location at which it concentrates. That is, the sample 
information dominates the prior information in intervals of order 1/n!/”. 

The big question is what constitutes “well-behaved” samples. Are they more 
than a set of probability 1 under the true prior? Less? 

Diaconis-Freedman consistency demands less than (3), but demands it for 
almost all samples from every distribution F. This may be an important property 
of a Bayes rule if that rule is regarded as just another way to get a point estimate 
of the true parameter. However, a Bayesian wants not an estimate but the 
posterior distribution of the parameter given the data. Doob’s theorem says that 
the posterior will be consistent for almost all parameter values. But consistency 
everywhere is neither necessary nor sufficient for a prior to be a good representa- 
tion of prior beliefs. 

Dirichlet priors have properties (3) and (4) (by the beta posterior of F(x)), but 
not (1), (2), or (5). Maximum possible independence is their beauty—they are 
manageable and consistent—but also their curse: They totally ignore smoothing, 
which is really the main issue, where prior information counts most. Their 
unsatisfactoriness is most telling in small samples (where failure to smooth 
matters most) but most provable in large samples (e.g., P(x, is new) < |lal|/ 
(jall + nz — 1) > 0 implies P(F is discrete) = 1, which is undesirable and an 
easier proof than we have seen in the literature). 

Now the Dirichlet happens to be consistent. Indeed, the case a constant, 
||~|| = O (suggested to us by Zellner), is as improper as can be yet gives the 
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empirical distribution, while a constant, ||a|| = 1 gives Fisher’s predictive distri- 
bution (equal probability in each of the n + 1 intervals defined by the order 
statistics). But in a May 17 discussion session of the Seminar on Bayesian 
Inference in Econometrics, many found fault with the Dirichlet, no one defended 
it beyond consistency, and much progress was reported with more satisfactory 
priors. So we do not believe that bad properties of “Jeffreys-style” or symme- 
trized extensions of the Dirichlet should faze “practicing Bayesians” as Diaconis 
and Freedman imply. As far as we know, all oddities can be attributed to the 
priors, not to more fundamental difficulties in Bayesian philosophy. If they could 
not, the identification of the unidentifiable would bother us more than the 
inconsistency of the symmetrized Dirichlet, because guaranteed symmetry seems 
even more unreal to us than a priori independence of location and shape. But 
both seem dangerous as philosophical testing grounds. 

Returning to one small part of the “big question”: How much consistency 
should we expect or require? We might expect consistency for all absolutely 
continuous distributions (densities), or all lattice distributions, but we would not 
require it beyond what Doob guarantees for the true prior, especially now that 
our eyes have been opened to how much this would imply. We would no longer be 
surprised, let alone dismayed, by inconsistency for the kinds of samples one 
would get for F continuous but singular with respect to Lebesgue measure. We 
would be happy to restrict attention to a topologically small family of prior 
distributions, such as those assigning probability 1 to densities. Whether F is 
part of an objective probability model, or only in the mind of the beholder, the 
“classical”—“subjectivist” distinction of Diaconis and Freedman, seems unim- 
portant to us, and irrelevant here, and we would look for merging of opinion to 
the same point as a consequence of whatever consistency is present, and merging 
in the full sense as a consequence of (4), not vice versa. 

The results on the sensitivity of the posterior to the prior (the last part of 
Section 3, and Appendix B) represent an interesting approach to the problem of 
choosing a convenient prior P that approximates the “true” prior P, in such a 
way that, given the sample x, the posteriors P, and P, are close (compare 
Krasker (1984)). Under the conditions of Theorem 4, the norm of the derivative 
of the map T from priors to posteriors is the ratio of maximum to mean 
likelihood. However, T is not even continuous without the somewhat artificial 
assumption that f(x|@) is bounded in 6. In addition, the results about the 
derivative 7, use the total-variation norm on both the priors and posteriors. This 
is an overly strong topology for the space of priors if the parameter indexes the 
set of continuous distributions on R, since the computationally feasible methods 
of approximating the true prior—say by a finite-dimensional parametric model, 
or even an extended Dirichlet process—assign probability 1 to a set that has true 
probability 0. As Diaconis and Freedman point out at the end of Appendix B, the 
results can be extended to the weak-star topology, say using the Prohorov metric 
d. (This requires a metric on 89; the natural way to provide one is to identify each 
0 with the distribution it indexes, and use either the Prohorov or total-variation 
metric.) Frechet differentiability in this context requires the further assumption 
that f(x|@) satisfy a Lipschitz condition in 6. (Continuity does not appear to be - 
enough to give the necessary property that d(fdP, f dQ) = O(d(P, Q)).) This 
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Lipschitz condition in @ (or even continuity, for that matter) is an additional 
severe restriction on the set of densities { f(-|@)}, requiring for example that they 
satisfy a Lipschitz condition in x, uniformly in 8. The norm of the Frechet 
derivative—the maximum of the ratio of change in the posterior to change in the 
prior, as the latter goes to zero—agrees with the formula ||Tp|| = f(x|Oyq,)/{f dP 
derived using the total variation norm provided P(N{@)) = o(e), where N{@) is 
the e-neighborhood of 6 in 8. (This condition should hold if © is more than 
one-dimensional.) This ostensibly shows, in situations in which the assumptions 
for differentiability hold, that in order to ensure d(P., P.) < e, we should select 
Ê satisfying d(P, P) < e/|T;l|. However, it is easy to show that ||Tp|| > œ as 
the sample size goes to infinity. This says in particular that in large samples the 
condition on P will be virtually impossible to satisfy, and says more generally 
that, contrary to intuition, it is in large samples that the posterior is most 
sensitive to the prior. We can get further insight into the local behavior of T by 
examining the second derivative p, which can be regarded as a symmetric 
bilinear map priors X priors > posteriors, and which will exist under the as- 
sumptions used to obtain T. We find that 


T.(H,G) = -( fiac |( fiae} rae - [fran|| frae) tac 
val fran fraa]| free) ra. 


In particular, in the second-order expansion T(P +H) — T(P) = Tp(H) + 
¿Tp(H, H), the second-order term can be important unless I(E] 1. If 
liTpl| is large, as it will be when the sample is large, the first derivative will yield a 
good approximation to T only too close to P to be of use. 

What, in jargon natural at our institution, is the bottom line? As far as we can 
see, it is that satisfactory prior distributions for nonparametric problems are still 
unavailable and that it is naive to expect too much in certain directions. This 
completes our discussion and Bayesian defense against frequentist analysis. If we 
have referred to ourselves unseemly often, it may signify that the foundations of 
statistics are personal. If we have seemed unseemly to the authors, be assured 
that we would have much pleasure in seconding a vote of thanks to them. 
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Our distinguished colleagues deserve congratulations for contributing yet 
another important study on the behavior of Bayes estimates. Looking over the 
main thrust of their essay, I find it ironic that two self-confessed former 
Bayesians have spent so much ingenuity showing that Bayes estimates can 
behave very badly, while the present writer, a staunch former and present 
anti-Bayesian, made efforts to emphasize the good properties of Bayes estimates. 
One could perhaps summarize the situation as follows: Take a family {P}: 8 € 8} 
of probability measures P, on a Polish space ¥ and suppose that @ itself is either 
Polish or at least Borelian in a Polish space. Then, according to Wald and others, 
for any decision problem, Bayes and approximate Bayes procedures form com- 
plete classes. If u is a positive finite measure on © and if 6 ~ PA) is 
measurable for each Borel A C 2, one can form a marginal measure u - P and a 
joint “semidirect product” measure u ® P by (u 8 PXB x A) = [,P,(A)u(dé). 
If one takes seriously the principle, call it Principle II, that sets of very small 
u @ P probability are practically negligible, then Bayes procedures for u @ P are 
good. If, however, one induces the distributions on £& through some other 
measure, say Q, Bayes procedures can behave in a most unpredictable fashion. 
This is so, as shown by our colleagues, even if » is itself a direct product of two 
terms (Dirichlet x Gaussian) that, separately, lead to excellent behavior. 

Under Principle II one obtains theorems such as Doob’s theorem of 1949 and a 
variety of other results. For instance, in the i.i.d. case, and many other ones, 
anything that is asymptotically Bayes for a prior measure p is also asymptoti- 
cally Bayes for any v = 0 dominated by u. In the most general case, with all the 
items in sight depending on some n that tends to infinity, suppose © metrized by 
a distance d and look at balls B(t, r) of center t ang radius r depending on x. 
Select, among balls whose posterior probability is > 4, one that has almost the 
smallest possible radius. Let 6, be its center. Then if for the joint measures u © P 
there are estimates T, that converge at a rate 6, (in the sense that for e > 0 there 
isa b< co such that | [u ® Pl d(T, 6) = bd„] < e for n large), then Ê, enjoys the 
same properties. The “tails” (u ® PÊ, 8) = b6,] also tend to zero at the best 

possible rate. 

There are many more properties of this general nature. Unfortunately, they 
give little information about what happens for observations X generated from a 
probability measure Q, unless it happens that Q is close to an average P, = 
fvPsu(d8)/u(V) for sets V whose p measure is not too small, or, if there is an n 
involved, for sets such that u(V) does not tend to zero too rapidly. 

In a paper (Le Cam, 1982) cited by Diaconis and Freedman, the present writer 
attempted to obtain bounds on the maximum risk of Bayes estimates in a 
ee describable as follows: One has independent observations X,, j 

., where X, has distribution p, ,, 6 € ©. One introduces a distance a by 
n t)= 15 fae, is - /dp,,}. Then, letting D(r) be the metric dimension of 
the space {©, H} at the level 7, one can show that there exist estimates T, such 
that E,H?(T,,, 8) < CD(a) where a is a number such that (for D(a) large) one 
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has approximately D(a)/a? ~ 324 and where C is a universal constant (< 10 for 
large D(a)). 

One would then expect that, for a loss function H*(z,6), and for prior 
measures p that are sufficiently well spread out, the Bayes estimates 8, would 
satisfy a similar inequality: E,H*(B,,9) < C’D(a). This is indeed the case. 
However, we could not find measures p that are sufficiently well spread out 
except under a severe growth restriction on D(t) as + > 0. Roughly, the growth 
restriction is that D{r) increases slower than r~! as + > 0. This rules out 
interesting cases, such as the case where © is the set of bounded densities 
satisfying a Lipschitz condition on the unit square of the plane. The nonparamet- 
ric sets used by Diaconis and Freedman have dimensions that increase very 
rapidly as 7 — 0, even if the distances used are much weaker than our H. Most 
small open sets have positive but essentially negligible probabilities. 

To obtain better results, it seems necessary to take into account features of the 
statistical problem that are not summarized by the distance H. Which features 
are most important is presently a matter of conjecture. Here, Diaconis and 
Freedman suggest a direction of study that may be very important: They 
investigate the derivative of the posterior measure viewed as a function of the 
prior measure. Now, let p - P be the marginal measure {P,(d6@), let u @ P be 
the joint distribution, and let K, be the conditional distribution of @ given x. 
Then, with the present symbolism 


(u: P) K(p, P) =p P. 


This relation can be differentiated not only in p but also in P. For instance, 
retaining only first order terms in e, one would have 


(u- P) {K(p, P+ eA) — K(u, P)} ~ {L 8 A- A8 K(p, P)}, 


a relation analogous to the one given by Diaconis and Freedman. It may be 
feasible from such relations to find out which features of u or {Py: 0 € ©} 
influence the posterior distributions and the attached risks. However, as far as we 
know the subject has not yet been studied in sufficient detail. 

Perhaps my formerly Bayesian colleagues will tell us in the near future what 
pairs (u, P) are “safe” and what pairs are bound to give trouble. 
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DENNIS V. LINDLEY 
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My own view of statistics is that it is a way of studying some aspects of the 
real world, namely the uncertainty present in any study, and of expressing my 
beliefs about the world. The subject is not primarily mathematical but mathe- 
matics plays an essential role because it enables me to pursue the logical 
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consequences of beliefs and see whether they conform with other beliefs. It is 
perfectly possible for the logic alone to change my beliefs. For example, at the 
moment of writing my belief that the 24th digit in the decimal expansion of 7 is 
4 is expressed by a probability of 0.1. Were I to do the mathematics, or accept the 
mathematics that others have done and consult a book, I would find that the 
digit is 3 and my probability is now 0 (or at least very small: The book might be 
in error or I could have erred in reading it). The most impressive example of logic 
changing beliefs is the work of Ramsey, Jeffreys, Savage, and de Finetti in 
demonstrating that beliefs need to be measured probabilistically and not, for 
example, by significance levels. 

Now in the present paper (referred to as DF) we have some impeccable logic 
that shows that in certain circumstances the Bayes estimate will be inconsistent. 
Just as the book changed my opinion of 7, so DF changes my beliefs about the 
estimation of a location parameter. Before discussing this let us clarify two 
points. In applying mathematics to the real-world problems of statistics it is 
always necessary to be reasonably sure that the mathematical modelling has been 
sensibly done. (Some might object to Ramsey’s work on these grounds.) The 
modelling in DF does seem reasonable to me and the results cannot be dismissed 
on these grounds. Secondly, I do object to the use of Bayes estimates. These are 
just a carryover from the inept modelling of sampling-theory statistics. The 
Bayes “estimate” of @ is the probability distribution of 0 given the data. As far 
as I can see this does not affect the conclusions of DF since their theorems relate 
to p, and not @ (Equation (1.1)). 

So what am I to make of the mathematical results of DF? Clearly they change 
my beliefs in some way, but how? One thing I could do is to change my prior 
beliefs and not use a Dirichlet with Cauchy measure. Jeffreys (1967) does 
something like this in a different context. In Section 5.2 he notes that a normal 
prior would lead to posterior views that are unacceptable to him: so he uses a 
Cauchy form and all is well. In DF I could replace the Cauchy by a normal. But 
it may be that the Cauchy form does adequately reflect my opinions so that the 
inconsistency persists. Now the result of DF tells me to beware of u,(@), at least 
in certain cases. But presumably if A really had the eccentric trimodal form of 
Figure 1, the empirical distribution function would reveal this as a serious 
possibility. Looking at that function I would loosely argue something like this. 
DF warned me about these trimodal fellows and yes, u„(0) does keep oscillating 
between the left- and right-hand values, so I had better change my view and 
think that the location is at the central mode. 

It is not clear to me how logical results should change my beliefs. Bayes 
showed us how to change with data but is there some sensible way to react to 
mathematics? (With a it was easy.) Here is a very simple example of the 
problem. I am considering two events A and B and after reflection assign 
probabilities p(A) and p(B), perhaps 0.6 and 0.5. Now DF comes along and 
demonstrates that A and B are exclusive, a fact I had not known. What are 
reasonable values of p(A) and p(B) now? One way suggested in Lindley et al. 
(1972) is to think of p(A) and p(B) as, in some sense, assessments, subject to 
error of “true” values 7(A) and 7(B). The observation of DF amounts to saying 
a(A) + w(.B) < 1 and the space of their values can be restricted accordingly. But 
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even I am not quite happy with this. The problem arises in sampling-theory 
statistics. Lindsay (1980) suggests estimating a binomial parameter, p say, by 
r/n, in the standard notation. But a little calculation in the problem shows that 
p = i — 20°: What is the estimate now (especially when r/n > })? 

There is a further aspect to the results of DF: They point out that @ can be 
estimated by using the median. They say “Bayes estimates do worse than 
available frequentist procedures.” Is this justified? There are two possibilities: 
either the median is a Bayes estimate or it is not. (By a Bayes estimate here I 
mean for a fixed prior for all sample sizes. Some frequentist procedures, like 
significance tests, are Bayes for each n but the prior to make them Bayes has to 
change with n. The Bayesians of DF are not allowed this luxury.) If it is a Bayes 
estimate the quoted claim is false. If not, then what is the frequentist doing using 
an inadmissible procedure? Could not the coherent Bayesian make the median 
user lose money for sure? 

As a paper about statistics—and although there is understandably no hint of a 
real-world usage, I take it we are discussing it as a statistical paper—it does not 
lessen my respect for the Bayesian argument but it does reinforce doubts about 
how a Bayesian should react to logical deductions, as distinct from data. The 
logic of DF therefore has important, and to me, unresolved consequences. 

I conclude with a few miscellaneous remarks. 

(a) The Dirichlet prior is unacceptable to me because it fails to incorporate 
the positive correlation that I feel between adjacent, nonoverlapping intervals. As 
a result the posterior is insufficiently smooth. Do the inconsistency results persist 
with some smoothing present? 

(b) There is a strong reason for Bayesians being interested in frequentist 
results because the latter are useful in experimental design (preposterior analysis). 
Before the data are to hand they are random and accordingly governed by 
probability laws; the likelihood principle does not obtain and the sample space is 
relevant. 

(c) It is easy to produce examples of Bayesians (and others) being misled. Let 
X, be iid. N(@,1) and let sampling continue until the hypothesis that 0 = 0 is 
rejected at the two-sided 5% level; that is, until |X| > 20/ Yn, X being the mean 
of a sample of size n. This is certain to happen. Let @ have a uniform prior. Then 
p8 < OLX, n) must always be less than 2% or greater than 974%, since @ is 
N(X, o?/n). 

(d) An earlier version of this paper was given by Freedman at the 1983 IMS 
meeting in Toronto. I did not attend the meeting but for months afterwards I 
had people coming up to me with undisguised glee telling me about the paper and 
implying that Bayesianism was now dead. May I remind any who think this that 
all frequentist procedures have counterexamples far simpler and far more deva- 
stating than any this paper contains. I do not remember these being discussed in 
the IMS journals: How about it, editor? I collected a few together in Lindley 
(1972). 

(e) My first, quick reaction to this paper was to dismiss it as modern mathe- 
matics out of control again. This is grossly unfair. The authors model commonly 
occurring situations in apparently sensible ways and produce unexpected results. 
It has given me much to think about and will continue to do so after the deadline 
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for submission of this comment has passed. Diaconis and Freedman have done us 
a service in exploring the consequences of apparently innocuous assumptions so 
carefully. 
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Introduction. We would like to thank the discussants for their careful work. 
For context, we summarize our position. 

(a) As a team, our motives are mixed to an unusual degree, because we differ 
on many issues in foundations, including the interpretation of some of our results, 
However, we are unanimous that the mathematics in our paper should be of 
interest to Bayesians, ex-Bayesians, and never-Bayesians alike. 

(b) Frequentists can use the Bayesian approach, like maximum likelihood or 
optimality, as a powerful heuristic engine for generating statistical procedures. 
No such engine is foolproof, so you should always look to see how well the 
procedure is going to do. Even the crustiest subjectivist ought to follow this 
advice, when the prior is only an approximation (and possibly quite a crude one, 
chosen for computational convenience) to the true subjective belief. Besides its 
practical importance, checking operating characteristics is good, clean mathe- 
matical fun, 

(c) Pitfalls in the classical approach are well known; those in the Bayesian 
approach perhaps less so. We have given some examples where plausible applica- 
tions of Bayesian technique lead to disaster. It is particularly easy to lose your 
way in high dimensional parameter space. 

(d) We view consistency as a useful diagnostic test. If your procedure gives the 
wrong answer with unlimited data, probably you will not like it so well with a 
finite sample either. 

(e) We show how putting conditions on the underlying model and modifying 
the prior can sometimes rescue Bayes procedures. As a general heuristic device 
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for developing such patches, we propose a kind of Bayesian sensitivity analysis, 
the “what if” method. 

To our dismay, the negative reactions came from Krasker—Pratt, and we 
answer them first. 


Krasker-—Pratt. Krasker and Pratt accuse us of false cheer with respect to 
classical procedures. For one of us (who is seldom accused of cheer with respect to 
anything) the change of pace is refreshing. They cite hierarchical regression 
models as a Bayesian success story, despite substantial evidence to the contrary 
(Freedman and Navidi, 1985). And they carefully avoid committing themselves 
on the question of whether consistency matters. As a final point of rhetoric, they 
take us to task for bench-testing a Bayesian solution to the simplest textbook 
problem of them all: estimating a location parameter with symmetric errors. Our 
view is conservative: If your procedure runs into trouble on that problem, you 
may have worse troubles on other problems. 

Their main point, however, is that “all oddities can be attributed to the priors, 
not to more fundamental difficulties in Bayesian philosophy.” Indeed, they 
continue, “in a discussion session of the Seminar on Bayesian Inference in 
Econometrics, many found fault with the Dirichlet, no one defended it beyond 
consistency, and much progress was reported with more satisfactory priors.” 

We do not know of any substantially different priors for use in nonparametric 
situations, meeting the minimum tests of consistency and computability. That is 
why the Dirichlet and its analogs were introduced; that is why people use them; 
and that is why we pitched our examples in those terms. If Xrasker and Pratt 
know better, they should say explicitly what priors they like. Then somebody can 
do the asymptotics. 

The main objection to the Dirichlet seems to be that it assigns full probability 
to the discrete distributions: The Dirichlet priors “totally ignore smoothing, 
which is really the main issue, where prior information counts the most.” Well, 
Krasker-—Pratt, lots of luck. The Dirichlet works on a divide-and-conquer strategy. 
You can divide the line up into a finite number of pieces, and forget what goes on 
inside each piece: This reduces an infinite-dimensional problem to a finite one. 
Then, you can repeat the process inside each piece. 

Smoothing forces data in different parts of the line to interact, and this creates 
a whole new level of technical complexity. Indeed, this interaction across inter- 
vals is probably what differentiates the location problem from the problem of 
simply estimating an unknown distribution function, where the Dirichlet per- 
forms very well indeed—despite its marked preference for discrete distributions. 

On the whole, the discreteness issue seems to us to be a red herring. Consider a 
Dirichlet with a normal rather than Cauchy base probability: This prior still 
concentrates on discrete distributions but the posterior is consistent. The prob- 
lem is caused by multimodal densities and a base probability which is not log 
convex, rather than by discreteness. 

At a more speculative level, we think there are counterexamples involving 
priors which have full support and concentrate on smooth distributions. How- 
ever, detailed calculations are difficult, and we have not done them. The first idea 
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is to smooth the Dirichlet, as proposed by Lo. Fix a kernel density k, choose F at 
random from the Dirichlet, and look at the convolution k* F. We will use 
k * D(a) to denote the law of k» F when F has the law D(a). Such priors 
are necessarily inconsistent, because they do not have full support: Indeed, 
max k * F < maxk. 

The next idea is to consider mixtures. Fix a sequence k, of kernel densities 
tending to point mass at 0, for example, normal with mean 0 and variance 1/n. 
Consider nonnegative weights w, adding to 1. Look at 


Lw,k,, * Dlan) 


This prior has full support, but concentrates on smooth densities. We think it is 
consistent for some choices of w, and a,, and inconsistent for others (if w, tends 
to 0 sufficiently rapidly, and the mass of a,, tends to infinity). Furthermore, the 
prior may be consistent for estimating an unknown distribution function, but 
inconsistent in the location problem, where the unknown distribution function is 
just a nuisance parameter. 

Here is another construction, which starts from the Dirichlet but forces the 
random distribution function to be absolutely continuous; compare Kraft’s (1964) 
modification of Fabius (1964). 


(i) Use a Dirichlet on the integers to distribute mass to the intervals[n, n + 1), 
but not within. 

(ii) Within [7, n + 1), use a beta to randomly split the mass between the left 
half and right half; the beta may depend on n. 

(iii) Keep on going, with different betas at different stages. Make the variances of 
the betas decay to zero very rapidly, so the random distribution function is 
nearly equal to its expectation; this can be, for example, either normal or 
Cauchy. 


If the decay of the variances is rapid enough, almost all the random distribu- 
tions will be smooth (absolutely continuous, maybe differentiable except at 
binary rationals). This prior will be consistent for estimating an unknown 
distribution function on the line, by the divide-and-conquer argument. Now 
consider the location problem; we guess this prior is consistent when its expecta- 
tion is the normal, and inconsistent with the Cauchy. 

The real mathematical issue, it seems to us, is to find computable Bayes 
procedures and figure out when they are consistent and when they are incon- 
sistent. We wish Krasker and Pratt would use their considerable talents to help 
solve the problem, instead of burying it deeper in a pile of rhetoric. 


Hartigan. Is Hartigan part of the problem or part of the solution? He seems 
to reject the idea that dimensionality of the parameter space matters. He is being 
uncharacteristically disingenuous, when he reproduces Freedman’s (1963) original 
counterexample using a countable set of parameters. The only sensible way to 
think about those parameters is as a sequence in an infinite-dimensional 
space—the way it was set up originally. The question about entropy neighbor- 
hoods is answered (positively) in that same paper. For our part, we confess to not 
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giving any careful, formal treatment of dimensionality. This leaves a clear field 
for Hartigan. 


Berger. Berger asks, “How likely is it for one to encounter a consistency 
problem in practice?” In nonparametric problems, we think that inconsistency 
will be the rule not the exception, unless great care is taken in specifying the 
prior. Even in high-dimensional problems, details of the prior can have substan- - 
tial and unanticipated effects on the behavior of the procedure. 

Berger and others sometimes suggest that subjective judgements can and 
should be quantified as probability distributions. This is a cornerstone article of 
faith for some Bayesians. In certain problems, this kind of quantification is surely 
possible and helpful. However, in other problems, it may not be. Indeed, the 
attempt to develop a full-blown subjective probability distribution may be 
counterproductive, while informal use of intuition could help. 

Berger (and Krasker—Pratt too) wonder whether inconsistency on a null set 
matters. We think it can. Man may be the measure of all things, but then one 
man’s null set can be another’s support. Freedman (1965) used category because 
it is neutral between measures, and because sets large in the sense of category are 
large for “most” measures. 

Finally, Berger asks for more details on our claim that estimates of the form 
proposed by Box—Tiao, Fraser, and Johns can be inconsistent. The proposed 
estimators are all of the form 


_ {OTF [(X, — 8)/o; A] »(d8, do, dà) 


= STOF (CX, — @)/o; A] »(d0, do, da)’ 





where f(x; A) is a family of densities and v is formal prior. For example, Fraser 
takes the family of all t-densities: const(1 + x7)’. 

These estimates are Bayes rules based on non-log-convex densities. We have 
checked that for some choices of prior », the rules are inconsistent. We have not 
shown that the rules actually suggested are inconsistent, but we believe them to 
be. 


Derivatives. Le Cam, Krasker—Pratt, and Clayton all make useful com- 
ments about derivatives. As they indicate, it is possible to calculate higher-order 
and even mixed partial derivatives with respect to the prior, the model, and the 
loss function. Clayton asks for a derivative of the predictive distribution with 
respect to the prior. We find this easiest to think about in the context of 
exchangeable processes, There, the predictive distribution is a linear function of 
the posterior, so the calculus is straightforward: If P;° is the product measure at 
parameter 6, p is the prior on @, and j,(d@) is the derivative of the posterior 
given x, the derivative of the predictive distribution with respect to the prior is 
the signed measure 


JPPO)ù.(d0). 
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Envoi. We will not comment further on the many other interesting points 
raised. We hope discussants not singled out for reply will be relieved rather than 
insulted, on the theory that no news is good news. Finally, we warmly thank the 
Editor, Associate Editor, and discussants for their encouragement and support. 
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ON INCONSISTENT BAYES ESTIMATES OF LOCATION 


By P. Draconis! AND D. FREEDMAN? 
Stanford University and University of California, Berkeley 


In some relatively natural settings, Bayes estimates of location are shown 
to be inconsistent. 


1. Introduction. Consider the problem of estimating a location parameter 0 
observed subject to errors e,. The data are modelled as 


(1.1) X,=0+e, 


where the e, are independent with unknown distribution function F. Bayes 
estimates are computed from a prior distribution for 6 and F. One natural choice 
is to take @ and F independent, F having a Dirichlet distribution with parame- 
ter measure a. Ferguson (1974) contains a review of Dirichlet priors. The 
posterior distribution for such a prior will be given in Lemma 2.1. 

With squared error as loss, the Bayes estimate of 6 is the mean of the 
posterior distribution. One of our principal results is that for some prior distribu- 
tions, the Bayes estimate is inconsistent: There are F’s with a density symmetric 
about zero such that the Bayes estimate for @ oscillates between two nonzero 
numbers as data accumulate. 

To be specific, suppose that the prior density f for 0 is standard normal, while 
the parameter measure a for the Dirichlet is Cauchy, having density a’ = g(x) = 
1/7(1 + x°). Let m, = 7,(X,,..., X,) be the posterior distribution of 6 and F 
given the data X,,...,X,. We will construct a C,, density A with compact 
support, symmetric about 0 and with a strict maximum at 0, such that if @ = 0 
and the e’s are independent draws from h, the posterior distribution of 0 
oscillates between two false values +y, and is therefore inconsistent. Here, y is a 
positive number depending on k. If desired, A can be chosen strictly positive on 
the interior of its interval of support. 


THEOREM 1. Let X, follow the model (1.1), where @ = 0 and the £, havea 
compactly supported C,, density h, which is symmetric about 0, with a strict 
maximum at 0. For the prior, 0 has the standard normal density, and F is 
independently drawn from the Dirichlet based on the standard Cauchy. For some 
k: as n > œ, almost surely, the posterior m, given X,,..., X,, concentrates near 
+y, where y is a positive number depending on h. For each large n, there is 
probability near } that q, concentrates close to y, and probability near } that m, 
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concentrates close to —y. Moreover, for any n > 0, 
lim sup m, {(8, F): |@-y|<n}=1 ae. 
lim sup 7,{(0, F): |@+y|<n}=1 ae. 


Of course under the conditions of the theorem, the median is a consistent 
estimate of 6, as is any trimmed mean; so the Bayes estimates are worse than 
available objectivist procedures. Further motivation and philosophical discussion 
is in Diaconis and Freedman (1986). The proof of the theorem is deferred to 
Section 2. Estimates of the form considered in the theorem have actually been 
suggested in the Bayesian literature; see Dalal (1979a, 1979b, 1980). Dempster 
(1976) gives an extensive survey of the Bayesian approach to robust estimation, 
which is linked to estimates of the form considered in the theorem. Similar 
estimates have been suggested by Fraser (1976) and Johns (1979) from the 
frequentist viewpoint. The argument for Theorem 1 shows that for some underly- 
ing distributions these estimators are inconsistent. 

The theorem remains valid if the normal prior density for @ is replaced by any 
smooth, everywhere positive density. Nor is the choice of Cauchy for a crucial. 
Any ¢-density works as well. However, if a has a density a’ and loga’ is convex, 
then the posterior converges to point mass at 6, for essentially any choice of A. 
This may be shown using the arguments of Freedman and Diaconis (1982b). 

The density A constructed for Theorem 1 has a single global maximum at 0; 
but A is not strongly unimodal—it has three local maxima (see Figure 1 in 
Section 2). Using the arguments of Freedman and Diaconis (1982b), it can be 
shown that if h is continuous, strongly unimodal, and symmetric about @, the 
posterior converges to point mass at @ almost surely. (A density is strongly 
unimodal if it increases to its unique maximum and then decreases.) 

In Diaconis and Freedman (1983b) we have an argument proving that any 
location mixture of Dirichlets gives a consistent estimate of the sampling distri- 
bution. So the marginal posterior distribution of the sampling distribution 
converges to point mass at #, the distribution function with density A. This has 
a peculiar implication: Indeed, when 7„ concentrates on @’s near y, then 7, must 
concentrate on F’s which are near # shifted to the left by y; wher 7, 
concentrates on @’s near —y, then 7, must concentrate on F'’s which are near 
X shifted to the right by y. Thus, 7, gets both ĝ and F badly wrong, but it gets 
the law of the data, namely F shifted by 6, nearly right. 

One of the issues in this example is the identifiability of the parameters. In 
general, of course, the convolution 6, * F cannot be decomposed into its compo- 
nents @ and F; here, ô, is point mass at x. However, from the point of view of a 
Bayesian with the prior in the theorem, the two parameters 0 and F are 
identifiable, as the next result shows. 


THEOREM 2. In the setting of Theorem 1, there is a measurable function > 
such that (ô; * F) = 6 for almost all 6 and F. 


In particular, a Bayesian will be convinced that his Bayes rule is consistent at 
almost all pairs (8, F). It is therefore of mathematical interest, and we think also 


70 P. DIACONIS AND D. FREEDMAN 


of interest from the point of view of the foundations of inference, to ask about 
consistency at particular (0, F), especially for symmetric F where @ is objec- 
tively identifiable. That is why we present Theorem 1. We give an extensive 
discussion of the relation between consistency and Bayesian inference in Diaconis 
and Freedman (1986). 

As noted above, one standard way to make @ identifiable is to require that the 
e’s be symmetric. Then, it seems reasonable to symmetrize the distribution 
function chosen from the Dirichlet. If G is the law of X, let G~ be the law of 
—X, and let G = 1(G + G7). So G is symmetric. Let D, be the law of F = G, 
where G has law D,; this is a “symmetrized” Dirichlet. 

For the next theorem, let 6 have a normal prior density, and let F be 
independent of 0, having the prior distribution D,, where a is Cauchy. Let 7, be 
the posterior distribution for 6: This will be computed in Lemma 3.1. Again, as 
Theorem 3 below demonstrates, 7, can oscillate between two false values +y for 
8. This time, the inconsistency spreads to the posterior opinion of the sampling 
distribution. Thus, a location mixture of symmetrized Dirichlets can be incon- 
sistent for the sampling distribution—even when the latter is symmetric. So, a 
straightforward way of putting a prior on symmetric e’s does not cure the 
inconsistency of the Bayes procedures. 


THEOREM 3. Let X, follow the model (1.1), where 6 = 0 and the £, have a 
compactly supported C,, density h, which is symmetric about 0, with a strict 
maximum at 0. For the prior, 0 has the standard normal density, and F is 
independently drawn from the symmetrized Dirichlet based on the standard 
Cauchy. The posterior T, given X,,..., X, is computed from this prior. For 
suitable h: as n -» œ, almost surely, 7, concentrates near +y, where y is a 
positive number depending on h. For each large n, there is probability near } 
that T, concentrates near y, and probability near } that T, concentrates near 
—y. Moreover, for any 7 > 0, 

lim sup7,,{@: |@—y|<n}=1 a.e. 

limsup7,{6: |@+y|<n}=1 ae. 
If desired, h can be chosen strictly positive on the intertor of its interval of 
support. 


The posteriors z,, and 7, are computed by using a theorem about the Dirichlet 
due to Korwar and Hollander (1973). Originally, we used a discretization argu- 
ment. An abstract version of this is given in Section 4. 

In Diaconis and Freedman (1982, 1983a) we discuss breakdown properties of 
the rules computed here. Related results appear in Huber (1984). 


2. The first construction. The first step is to compute the posterior distri- 
bution of @ and F given the data; a similar result was conjectured by Dalal 
(1979a). 
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LEMMA 2.1. With respect to the prior, let @ and F in the model (1.1) be 
independent, 6 having density f and F being Dirichlet with parameter measure a 
which is absolutely continuous; let g = a’ /|a\|, where |\a\| is the mass of a. Let 
A,, be the set where X, + X, fori si<j sn. On A„, the posterior z, can be 
characterized as follows: 


n4(d8) = G,4(0) TL aX, ~ 8) a8, 
where 
Ca = f° KOTTEC,- 8) dð, 


n {dFj0} is p|a+ È r-o); 
ml 


Proor. Write P for the joint law of @, F and X,, Xo,.... Thus, 7, is the law 
of 0 and F given X,,..., X,, computed according to P. On A,,, we may compute 
m, by first conditioning on A,, then on Xj,..., X, But A, = {e,#e, for 
l s i<j <n}, so relative to P given A,: The parameter ĝ still has density f; the 
e, are independent with common density g, and are independent of @, by 
Theorem 2.5 of Korwar and Hollander (1973); and X, = 8 + e, O 


REMARK 1. Lemma 2.1 gives the posterior when the observed values are all 
distinct. The argument also gives the posterior in general: (d0) = Cz m, (d0) 
= Cy 'f(@)I1*g(x, — 0)d0, where C, = f2 a f(@)I1*g(x, — 9) d0; the * signifies 
that the products are over distinct values only. Finally, 7,(dF|@) remains the 
same. 


REMARK 2. Under squared error, the Bayes estimate 4(X,,..., X,,) of 6 is 
the posterior mean. For the posterior computed in Lemma 2.1, 
ACX. Xn) = Oxf” 0f(8) [Lal X, - 6) 0. 
— 0 tol] 
This coincides with the Bayes rule for the model (1.1) if e, has known density g; 
in the display, however, g = a’ /|aj is a feature of the prior for F. 


Here is a sketch of the rest of the argument for Theorem 1; the rigor will 
follow. The computation is very similar to the one in Freedman and Diaconis 
(1982b). Let 


(2.1) M(x) = log(1 + x”), 
50 


a(x) = ~exp{-M(2)}. 


(2.2) H(u) = [M(x ~ u)h(x) dx. 
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The density A will be constructed so that A(x) dx is essentially mass } at each of 
+a, where a > 1. Then H(u) has a local maximum at 0, and global minima at 
+y, where y = ya? — 1. Now by Lemma 2.1, the posterior density of @ is 


(2.3) C, m~ "f(8 )exp{ -S,(8)}, 

where 

(2.4a) S,(0) = > M(X, — 6) = nH(0) + ¥nG,(6) 
and 

(2.4b) G,(8) = = È [M - 8) — H(6)] 


is asymptotically a Gaussian process. In particular, the (unnormalized) posterior 
mass on a neighborhood of +y vanishes at the rate exp{—nH(y)} while the 
posterior mass outside this neighborhood vanishes at a faster rate. Thus, the 
normalized posterior mass concentrates near +y. The distribution of mass 
between the vicinity of y and the vicinity of — y is controlled by the relative sizes 
of G,(y) and G,(—y); these two variables have a nonsingular Gaussian joint 
limiting distribution, so posterior mass shifts back and forth between the two 
neighborhoods. 


LEMMA 2.2. Fix a> 1. Recall M from (2.1). Let 
H,(@) = 4M(a - 6) + 1M(-a-— @). 
Then H,(-) is symmetric, has a strict local maximum at 0 where HZ’ < 0, and 
strict global minima at +(a? — 1} where HY > 0. 
Proor. Calculus. O 


LEMMA 2.3. There is a compactly supported C,, probability density h which 
is symmetric about 0 with a strict maximum at 0. In addition, 
H(@) = [M(x ~ 6)h(x) dx 


is symmetric, has a strict local maximum at 0 where H” < 0, and strict global 
minima at +y where H” > 0. Here, y > 0 depends on h, but is close to 
(a? — 1)'”*. See Figure 1. 


Proof. Choose a sequence h, of densities which are Cp, symmetric, sup- 
ported on [— 2a, 2a], with strict maxima at 0, such that h„(8) d0 —> 45, + 36_,. 
Now look at the derivatives 


HY(0) = (—1) [M(x — O)hy(x) dx 


> (=1)3[MO(a - 0) + MO(—a ~ 8)] = HP(8), 
because MOU? is bounded continuous. Now use Lemma 2.2. O 
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Fic. 1. Graph of the density h and the function H(6) = [M(x — 9)h(x) dx. 


Let X,, X>,..., be independent, with the common density h constructed in 
Lemma 2.3. The asymptotic behavior of S,(@) in (2.4) will now be determined, 
with a view to proving the oscillatory behavior of the posterior; for inconsistency 
alone, see Berk (1966). Some results on the Brownian bridge will be helpful. To 
state them, let # be the distribution function of the density A constructed in 
Lemma 2.3, and let #, be the empirical distribution function of X,, X,,..., X,. 
Let 


B, = Vn (#6, — #). 


Notice that B, vanishes off [—2a,2a]. Of course, B, = B, ° #, where B, is the 
approximate Brownian bridge on [0,1] based on the empirical distribution of a 
sample of size n from the uniform. The law of the iterated logarithm (Chung, 
1949) implies the following result. 


LEMMA 2.4. There is a finite constant A, and for almost all w an N = 
N,, < œ, such that n > N entails |B,(t)| < A(loglogn)'” for all t. 
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Clearly, B, converges weakly to the Brownian bridge B,= Bo 3, where B is 
the ordinary Brownian bridge on [0, 1]. 


LEMMA 2.5. Let ¢ and y be two bounded continuous functions on the line. 
Then 


(a) E{ fo dBy} = 0; 
(b) cov{ fọ dBy, fy dBy} T cov[ $(X,), 4(X,)]. 


Proor. This can be reduced to the corresponding result for the ordinary 
Brownian bridge B. Since B(t) = W(t) — tW(1) for 0 << 1, where W is a 
standard Weiner process, the result for B is easily checked. 0 

LEMMA 2.6. Let X have a symmetric, absolutely continuous distribution. 


(a) The distribution of M(X — y) — M(X + y) is absolutely continuous; 
(b) P(M(X — y) = M(X + y)} = 0; 
(c) E{M(X — y)*} = E{(M(X + y)*} for k = 1,2,.... 


Proor. (a) Let o(¢) = (t— y) — M(t + y). Then ¢ is smooth, and two-to-one 
except at 0. 

(b) Immediate from (a). 

(c) Use symmetry. O 


The notation in the next lemma may seem perverse, but Z, is associated with 
behavior near y, and Z_ with behavior near — y. 


LEMMA 2.7. LetZ,= [M(u — y)dB,(u) andZ_= {[M(u + y) dB,y(u). Then 
(Z,,Z_) has a nonsingular symmetric bivariate Gaussian distribution. 
Proor. By Lemma 2.5(b), 
cov(Z,, Z_) = cov[ M(X, — y), M(X, + y)]. 
Symmetry follows from Lemma 2.6(c); if the distribution were singular, it would 
have to concentrate on the 45° line, contradicting Lemma 2.6(b). The mean is 0 
by Lemma 2.5(a). O 
Turn now to (2.4). Clearly, 
(2.5) G,(@) = {M(u — 8) dB,(u). 
Then 
G8) = (—1)"JMO(u — 8) dB,(u) 
2.6 
ee) = (-1)7"" [B,(u) M+ (wu — 8) du. 


Now Lemma 2.4 can be used. 
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LEMMA 2.8. Let j be a fixed integer. There is a finite constant A = A,, and 
for almost all œ an N=N,< œ, such that n>N entails |G‘(6)| < 
A(log log n} for all 8. 


LEMMA 2.9. Under the conditions of Theorem 1, if » > 0, then on A,, 
mAlO F): |8@-y| <7 or |@+ yl <7} > lae. 


Proor. Let D, = C7 w~”, the normalizing constant. Using (2.3) and Lemma 
2.8, a.e. for large n, 
m,{(0, F): |8 — Yl <n/2 or (0+ y|<7/2} 


2z 27: D, min 0) - exp{ —np — A(n loglog n)®}, 
7 2 p10) pf -np — A(n loglog n)” } 


where 


p= max H(8). 
—y<1/2 


On the other hand, 
a,{(0,F):|@—y|2 and |@+ |=} 


<s D,: fio) dô- exp{ —np* + A(n loglog n)'”"}, 


where p* = ming #(4), with C = {|0 — y| > ņ and |@ + y| = n}. Clearly, p* > p. 
oO 


LEMMA 2.10. Let 
Z,n= {M(u—y)dB,(u), Z_,= [M(u + y)dB„(u), 


Yn = [M'(u—y)4B(u), Y-a = fM(u + y) 4B,(u). 


Fix ô positive but small. Choose 1 positive but so small that for |0 — y| < n, 
H”(y)— 8< H”(0) < H”(4) +8. 


Put t = yn(0 — y). Almost surely, for all sufficiently large n, for all 6 with 
|@ — y| < n, upper and lower bounds on S,(@) are, respectively, 


nH(y) + WnZ,, SN gh 4[H’(y) + 28] 2? 
and 
nH(y) + Vn Zan F Y, at + 1[H”(y) = 26] t. 


Likewise, putting t = Yn (8 + y), almost surely, for all sufficiently large n, for 
all 0 with |0 + y| < n, upper and lower bounds on S,(6) are, respectively, 


nH(y) + nZ,- Y_,t+ 4[H’(y) + 26] 2? 
and 
nH(y) + ¥nZ_,, — Y_,t + 4[H’(y) — 28] 0?. 
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PRooF. Only the result at y needs to be proved, for the situation at —y is 
symmetric. Recall (2.4a). To estimate H(@), use Taylor’s theorem. Of course, 
Hy) = 0 because H has a minimum at y; and H’(y) > 0: see Lemma 2.3. 
Expanding around y, 


H(y) + 3[H"(y) — 8](8 — y)? < H(8) < H(y) + 4147) + 8] (8 — y}. 
Multiplying by n, 
nH(y) + 4[H(y) — 6]t? < nH(6@) < nH(y) + 3[H’(y) + 8] e?. 
Now G,(@) must be estimated. Expanding M’ around u — y, 
M'(u — 6) = M'(u — y) +M"(u-~ y)(y- 9) + 4M (tua) - 8), 


where {,, is between u—@ and u — y. Integrate both sides with respect to 
dB,(u), and then integrate by parts as in (2.6): 


Ga(0) = Zan + (y¥- 9)¥., + EnO), 


vn G,(8) = VnZ sn Saar ynf,(8), 
where 
tlO) = —3(y - 0)? [B,(u)M ” (Euo) du. 
In view of Lemma 2.4, almost surely, for all sufficiently large n, 
ECO) < Alloglogn) (y — 6)’ for all 6. 
Then 
ynif,(9)| < A(loglog n/n) t? < 182? 
for n large. O 


LEMMA 2.11. Let o? = 1/[H"(y)— 28]. Almost surely, for all sufficiently 
large n, the posterior 6-mass in [y — n, y + 9] is bounded above by 


1 
C; "exp { —nH(y) 7 ynZ,,,} . exo 5 0°¥?,| ` max f: oO: Vr. vn 


2 [y-n y+a] 
Likewise near —y, replacing Z „„ and Y,,, byZ_, and Y_,,. 
Proor. In view of (2.3) and Lemma 2.10, the posterior density for @ is 
bounded above by 
C'a”. exp[-nH(y) — ¥nZ,,,| - exp[407Y?,,] ov20 
(2.7) i 1 (t-en) 
oeral oy2a 20? P 


where u„ = o”Y,,,. Integrate (2.7) over y — 7 < 8 < y + 7 with respect to d9, by 
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changing variables, to get 


t t f ot as g 


Vn ova Jing | 20? Vn 


LEMMA 2.12. Let t? =1/[H’(y) + 28]. Let W be the standard normal 
distribution on the line. Almost surely, for all sufficiently large n, the posterior 
6-mass in [Y — n, y + 9] is bounded below by 


1 
C'a "exp| ~nH(7) — VnZan] . ex 51°Y2,) - min 
2 [y-n y+n] 





fer 


1 
VIr - Te N [Y,a mn, 7Y,, + nn]. 


Likewise near —y, replacing Z,,, and Y,,, by Z_„ and Y_„- 
Proor. As in Lemma 2.11.0 


PROOF OF THEOREM 1. Almost surely, the posterior concentrates near + y: 
see Lemma 2.9. For each large n, it will now be argued, there is chance near } 
that the posterior distribution of @ concentrates near +y. This follows from 
Lemmas 2.11 and 2.12. Indeed, the 4-tuple 

Zin is Yin Yon 
has a limiting distribution, where the first two coordinates are symmetric and 
jointly absolutely continuous: see Lemma 2.7. Thus, for K large but fixed and n 
large, 

P{lY,,|< K}>1-64, 
1 1 
Biasa z) > z — 6. 
If both events occur, the posterior mass near y overwhelms that near —y, in the 
ratio 
const. exp{¥n(Z_,— Z,,)} > const e?/*. 

Likewise, — y wins with probability near 4. 

The final assertions about the a.e. behavior of the limsup follow from the 

Hewitt-Savage 0-1 law, because 
P(iim sup A,,} > limsup P{A,} 2 4, 
where 
A, = {7,[|8 -— yl <n] > 1— 8} 
is a symmetric function of X,,..., X,. These are iid., so P{limsupA,} = 1.0 


PROOF oF THEOREM 2. If F~ D(a), then 6,+* F~ D(B) where B = 6, *a, 
and « is the Cauchy. Let G ~ D(B) and 


7 1 1 1 
w- 65.5 )ol0.5]. 
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The W, are independent £ variables, with parameters p, and qg,, where 


1 1 1 
p= 6.2] 2 Ox. 





i+1’i 
1 1 
q,= alo, al = B(0)-. 
Next, V, = p,(—log W,) is asymptotically exponential with parameter 1; espe- 


cially, ce-* < P{V, > x} < e[l — e7*/Pi]~'e"*, where c, = F(p, + q,)/ 
PL (p,T(¢,) > 1 because I'(8) = 1/8. Then 


1 n 
Vd 
t=1 
for D( B)-almost all G. Thus 
(2.8 = A 
; D =(- > — ae. 
) n A 53 ( og 7) B’(0) ae 
When a is Cauchy, B = ô; * a, 
10 1 
PO ia 
80 
1 
8? = —— -i 
mB’(0) 


can be recovered a.e. from G = 6, * F. Likewise, (1 — @)* can be recovered, and 
then 6. The exceptional null set in (2.8) depends on £, i.e., on @. By Fubini’s 
theorem, (2.8) will hold for a.a. 6 and F. O 


3. The second construction. We begin by computing the posterior distri- 
bution 7, of 8 given the data. The posterior distribution of F given @ and the 
data will not be needed. A result of the following form was conjectured by Dalal 
(1979a) who checked special cases for samples of size 2 and 3. 


LEMMA 3.1. With respect to the prior, let 0 and F in the model (1.1) be 
independent, 0 having a density f and F having the prior distribution D,, a 
symmetrized Dirichlet with parameter measure a. Suppose a is symmetric and 
absolutely continuous; let g = a’ /jal|, where ||a\| is the mass of a. The data are 
Xis., Xn Let A, be the set where X, + X, fori +j and all 6,, are distinct, 
with 

6,, = (X,+ X,). 
On A,,, the posterior T, can be written as 
Ta = (Tan + Tan)/ Ch: 
Again, the constant C, depends only on the data X,,..., X,, and normalizes T, 
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to have mass 1. The measure 7,,, is absolutely continuous, with density 
jo (8) T] 80, = 8). 
The measure Tg, is discrete, with atoms at 8,, for i + j; the mass at 0,, is 


#[1(0,)e(8,)] [Lal %s~ 8) 


where 
ô, = 3(X, a A): 


Proor. It is convenient to represent the joint distribution P of 
6, F, X,, X_,... as follows. Let 8 be the result of folding a onto (0, 00): Formally, 
B is a finite measure on (0, œ), and 

B(0, x) = a(-x, x). 
Let 8 ~ f and G ~ D(8), independent. Given @ and G, let ô}, 5,,... be indepen- 
dent with common distribution G; let |, §,,... be independent, each being +1 
with probability 4. We present F, e, and X, as 
F=1(G+@), & = £6, X, = 0 +e. 

We claim 
(3.1) on A,,, no three ĝ’s can be equal. 
Indeed, suppose by way of contradiction that, e.g., 6, = 6, = 83. Then X; = X, 
or X,. Likewise, 
(3.2) on A,,, at most one pair of 5’s can be equal. 
Indeed, suppose by way of contradiction that, e.g., 6, = ô, and 6, = 6,. Then 
{a = — $, else X, = X,; and [, = —{,. So biz = 63,4, a contradiction. 

Let A, be the event that 6,,...,5, are all distinct. For i + j among 1,..., n, 


let A,, be the event that 6, = 6, but all the other 5, are distinct from each other 
and ô,. Abbreviate a = |la|| = ||8||. Then 


(3.8) P(A,) =a" '/(a+1)-:-(a+n-1), 

(3.4) P(A,,) =a7?/(a +1) (a+n-1). 

Formula (3.3) is immediate from Lemma 2.1 of Korwar and Hollander (1973); 
formula (3.4) is a small variation on that lemma, and can be derived from it by a 
symmetry argument. Or both formulas can be derived by discretization. 


As (3.1)-(3.2) show, A, C Ag UU, <,<,<nA,,- Let X = (X,..., Xp). Bayes’ 
rule shows that on A,, 


(3.5) P{d0|X} = P{A,|X}P{dajX}+ È P{A,|X}P,{d6|X}, 


Isi<jysn 


where P, is the law of @ and X = (X,,..., Xna) given Aj; likewise, P,, is the law 
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of @ and X given A, . More specifically, with g = a’ /\|a\: 


Relative to Py: @ has density f; 
X, — 0 has density g; all are independent. 


Relative to P,,: 0 has density f; X, — 0 has density g; 
all are independent, except (X, — 0) = —(X, — 0), i.e., 0 = 0, 


(3.6) 


(3.7) 


This is because, given A,,, the 6,’s for k +j are independent with common 
distribution £/j£ll, while 6, = 6,. This follows from Theorem 2.5 in Korwar and 
Hollander (1973) by a symmetry argument; or a discretization argument can be 
used. On A,, §, = —§, and e, = —e,. In particular, as will be seen in more detail 
below, P, and P,, restricted to A,,, all put absolutely continuous distributions on 
Xis... Xn with densities fọ and f, respectively. The 0-term in (3.5) corre- 
sponds to Tan; the y-term to the atom at 8,, in Tgn- 


nm 


Note that P(A,,) does not depend on i and j. Let 
(3.8) D, = P(A) hl Xo- Xn) + P) E IlXo Xa) 


lsi<jygn 


In effect, D, is the probability density of the data, computed a priori. It will 
develop that the normalizing C,, in the lemma is D,/P(A,,) = ||a||D,/P(Ao); see 
(3.3)—-(3.4). By Bayes’ rule, 


P{A,|X} = P(Ag) f(X- “9 Xn)/Dn 
P{A,|X} = P(A,,) f( Xi; edad X,,)/D,. 
The 0-term in (3.5) is easily dealt with: By (3.6), 


Xo Xn) = f” AO TTE- 8) a0 


and 
E0) TT a(% ~ 8) dé 
P {d8 X) = —— = 
of | hil Anise Xa) 
So 


P(g X} Po{d6|X} = [P(40)/Dy] fC) TT (Xe — 4) 48, 
as required. 
The jj-term in (3.5) is a bit harder. Let £, = X, — 0 for k + i, j and 

e= X,-6= -(X,- 8). 

Thus, ô ~ f, e, ~ g, and e ~ g, all are independent. 
X,=O+e,, ki, j, 
X,=0+e, X,=0-e. 

Recall 6,, = 4(X, + X,) and 6,, = 7(X, — X,). Now f,„ the density of X,..., X,, 
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can be computed by the usual calculus, and is 
Liesa = 1f(4,,)a(8,;) Al a(x, g 6,,) 
oJ 


= #[1(0,)/al6.,)] Tele- 4.) 


Of course, 
P,,{d6|X} is point mass at 9,,. 


This completes the proof, up to routine algebra. 0 
REMARK. Given 6 and X,,..., X„ the law of F is Dp, where 


B =at L by —0- 
im] 

See Lemma 4.3 of Diaconis and Freedman (1984). This and Lemma 3.1 determine 
the joint posterior distribution of 0 and F. 

Specialize now to the case where g(x) = (1/7)1/(i + x*). Then 

Tan( d0) = an™”f(0)exp [-S,(0)] dd, 
where S„ was defined in (2.4), and a = |ja||, and 
Talde) = 30 E [ £(4,,)/e(8:,)]exp[ -S,(4,,)] 8 
lsi<jsn 

By construction, the X,’s will be independent with common density h, where 
h is as in Lemma 2.3, with modes at —a, 0, and a. If A vanishes except near its 
modes, it will be seen that 7, is negligible by comparison with 7,,, and the 
argument for Theorem 1 goes through unchanged. If h is strictly positive on the 


interior of its interval of support, however, 7,,, dominates, and must be carefully 
estimated. 


ye 


LEMMA 3.2. Construct h as in Lemma 2.3, but require h to vanish except 
near 0 and +a. In particular, if X and X’ are i.i.d. h, and h is the density of 


X = 4(X + X’), require h to vanish in a neighborhood of +y. Under these 
circumstances, ||an|/\\Tanl| > 0 a.e. as n > œ. 


ProoF. The argument is as in Lemma 2.9. Fix a small open interval around y 
in which A vanishes. Find 6 small but positive such that H > H(y) + 6 off that 
interval. Keep n so large (depending on w, and possible except for a null set) that 
|In~'S,(@) — H(8)| < 8 for all @. 

Since {(#)d@ assigns positive mass to neighborhoods of +y, and S,(#) = 
nH(6), the mass of 7,,, is at least 

const. 7~"exp{ —n[H(y) + 4]}. 
On the other hand, @,, is bounded away from +y, so the mass of 74, is at most 


const. 7~"exp{ —n[H(y) + 28]}. o 
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This completes the proof of Theorem 3 in case the density h of 6,, vanishes 
near +y; indeed, 7,, was estimated in Lemma 2.10. 

Turn now to the case where A(y) = h(—y) > 0. (For an existence proof, this 
case need not be considered—unless, for instance, we want h to be positive on 
the interior of its support.) By construction, the X, are independent with a 
common density h, which is C,, and has compact support, as in Lemma 2.3. The 
factors f(9,,)/g(6,,) in 74, are therefore bounded above and below by positive 
constants, and can be ignored for the rest of the argument. Thus, let 


âde) =n E  exp[—S,(6,,)] 8», 
lsi<jsn 


(3.9) 
= 7-"(*)exp[~$,(0)] 44(48), 


where q,, is the empirical distribution of the (3) numbers 6,,=1<i<j<n, 
assigning mass 1/ és to each. This #, is a good-enough approximation to Tgp 
Let q be the theoretical distribution of 6,,, so q has density A, and let 


Ip 
(3.10) ā,(d0) = 0-"( 5 Jexp[-S,(0)]a(d0). 
As will be seen, 7, and 7, are close. 


LEMMA 3.3 


(a) [Tanl] > 0 a.e. 

(b) #{|8 — y| <9 or |0 + y| < n}/l7,|| 1 ae. 

(c) #,{18 Y< n}Al%|| > 4 ae. and @,{\8 + y| < n}/lz,|| > 2 ae. 

(d) lim sup, .. ,,7,{|@ — y| < m/fl = lim sup, _ .7,{|8 + Yi < a} All| = 1 
a.e. 


Proor. (a) Holds because of the factor g : 
(b), (c), (d) Can be argued as in Section 2, because A(+y) > 0. O 


The next part of the argument is designed to show that @#, — 7, is negligible 
by comparison with #,, near +y. Recall the notation of Lemma 2.10. Fix e > 0. 
Define random variables R* as follows: 





(3.11)  #[y-ny+a] = 2°" 3) al-al) -ynZ,,,] Rt, 
(3.12) #[-y-—9,-yt+a]= (3) exp|—nH(y) — ynZ_,| R}. 


LEMMA 3.4. The distributions of R} and R, are tight. 


Proor. This is argued as in Lemmas 2.10 and 2.11, because h( +y) > 0. O 
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Since the X, have common density h, the H in Lemma 2.3 can be written as 
follows: H(0) = E{M(X, — @)}. Recall that h is C,, and compactly supported; 
and S,(8) = D%.,M(X, — 8). 


LEMMA 3.5. Almost surely, for each j, as n> œ, SY(0)/n > HY(8) 
uniformly in ô. 


ProoF. Starting with (2.4), 


1 12 
— 9g) mee O) = 
—S(8) = L MX, ~ 9) 


wl 
1 
= HY(6) + =GY(e). 
(8) + = G8) 


By Lemma 2.8, 
G6) = O(log log n). o 
Recall that q, is the empirical distribution of the @,,, and g is the common 
theoretical distribution of each 6,,. Recall from Lemmas 2.3 and 2.3 that A is 
supported on [—2a,2a], where a > 1. Consider the sequence @, of processes 
{n'7[9@,(t) — q(t)]: —2a < t< 2a}. These processes are nearly uniformly 
bounded and equicontinuous, in the following sense. 


LEMMA 3.6. 

(a) For any ô > 0 there is a finite B = B, such that P{\Q,(t)| < B} > 1 — ô 
for all n. 

(b) For any ô > 0 there is a finite nọ = nog and a positive §* such that for all 
n > no 

P{\Q,(t) — @,(s)| < 6 for alls, t with |s — t| < 6*} 21- ô. 

Proor. Let r, be the empirical distribution of the X,+ X, and r the 
theoretical distribution of each X, + X,. It is enough to prove the assertions for 
n? {r, — r). Let p, and p be the empirical and theoretical of X,. Thus, r = p * p. 
But 7r, = p,, * p,- Indeed, as is easily verified, 

vn |l7, — Pr * Paleo > 0, 
where || || is sup norm. Thus it suffices to prove the assertions for 
vn [p,*, — p*p]. 
This last is 
vn[p,*p,- p*p,] + ¥n[p,*p — pp]. 
The first term at t is 


vn [o,(t — u) — p(t- u)] 0, (du) 
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and the second is 


[vr [on(t — u) — p(t- u)] (du). 
But with overwhelming probability, as a function of n, 
Vn [p,(t — u) — p(t- u)| 


is uniformly bounded and nearly equicontinuous. O 


REMARK. n)/“[q,(t) — q(t)] converges in the sense of the invariance princi- 
ple to a Gaussian process with mean 0 and the same covariance structure as 
HEt — X)], where X ~ # the distribution function of A. 


Recall #, from (3.9) and #, from (3.10). 


LEMMA 3.7. (fn — 7, Ly — 1, Y + 1] = o{%,[y — n, y + 0]} in probability as 
n > œ, and likewise at —y. 


Proor. Multiply across by n’/7"/ (2) to get rid of the extraneous normal- 
izing constants. The assertion becomes 


(3.13) vm f”~"exp[-,(9)](¢, - @)(d0) = of exp[—nHH(y) - vnZ,,]}.- 
yon 
Integrate the left side of (3.13) by parts as T, + T,, where 
T, = vn exp[-S,(8)]4,(8)|,_. 


y-7? 


T,= "Vn G,(9)exo[ —S,(0)]5,(0) 49, 


G,(9) = [¢,(8) - 4(8)] — [a,(v) — al). 


Take T, first: yn ĝ„(Y + n) is small in probability if 7 is small, uniformly in n, 
by Lemma 3.6; and exp[—S,(y + 7)] vanishes at a faster exponential rate than 
exp{ —nH(y)}. Next take T,. Again, Yn ĝ„(8) is small in probability for all @ with 
l8 — y| < n if 7 is small, uniformly in n, by Lemma 3.6. Thus, it suffices to prove 


(819 f "expl-S,(4)]|8,(8)| 40 = 0,{exp[-nH(y) - VnZ.4]}. 


Now a.e. for large n, S, is strictly convex in [y — n, y + 4] and has a unique 
minimum on that interval, at say y,; see Lemmas 2.3 and 3.5: S/’/n is uniformly 
close to H” in a neighborhood of y, where H” is positive. Thus, S; is negative on 
[Y — n, Yn] and positive on [y,, y +n]. The contribution from, e.g., the first 
interval is 


f” exp[-S,(6)][-S;(4)] d8 = exp[-S,(y,)] - exe[-S,(7 - 1). 


Y7-7 
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Fic.2. The density with six modes 


Now exp[ —S,(y — 7)] vanishes at a faster exponential rate, and we are left with 
showing that 


(3.15) exp[—S,(y,)] = O,{exp[-nH(y) - ¥nZ,,,]}- 
Refer to Lemma 2.10: With t = yn (0 — y) and c = [[H’(y) — 28], 
S,(t) = nH(y) + ¥nZ,,— Yant + tet? 
= nH(y) + ¥nZ,,~- 4¥2,/e+ te(t-c¥,,) - 
Thus 
S,(%,) = nH(y) + ¥nZ,, - 4¥2,/¢ 


But Y,,, is tight. O 


REMARK. Recall that A has modes at —a, 0, and a. But if, e.g., the posterior 
7, concentrates on @’s near — y, then it also concentrates on F ’s whose densities 
(Figure 2) have six modes, not three, at —a — y, -y, Y- 4a, a— Yy, Y,a +y. 


4. An approximation theorem for conditional probabilities. At one 
time, we computed the posterior distributions in Theorems 1 and 3 by discretiz- 
ing and passing to the limit. Proposition 4.1 gives a rigorous justification for this 
procedure, which may be useful in other contexts: e.g., see Ferguson (1973, 1974). 
To motivate the result, consider the following computation of a posterior. Let A 
be the set of probabilities A on R, and u a prior probability on A. We wish to 
compute the posterior distribution of A given a sample of size n from A, by 
discretization. Let k be a large positive integer. Let f, discretize R in the usual 
way: [,(x) is the least j/k > x. Let g, lump A to match: g,(A) assigns mass 
ACJ — 1)/k, j/k] to J/k. 


To define the posterior carefully, let P, be the probability on A X R” for 
which 


(4.1) P(A xB) = | P(B)u(dd), 


where A is a Borel subset of A and B a Borel subset of R”, while P, is the 
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power probability à” on R”. This is the joint distribution of the parameter A and 
the data x € R”. Let m be the marginal probability on R”, namely, m(B) = 
PA x B) for Borel B c R”. This is the law of the data. The posterior distri- 
bution is the “Markov kernel” a(x, dA) satisfying 


(4.2) P(A x B)= iF a(x, A)m(dx) 


for Borel A C A and BCR”. A Markov kernel a(x, A) is measurable in x for 
each A, and a probability in A for each x. In sum, ~ is a regular conditional 
distribution for the parameter A given the data x. In this case, posterior 
distributions are hard to compute, because the family P, is not dominated by a 
o-finite measure; there is no likelihood function. 

One way to compute the posterior is by discretization. Recall f, and g, above. 
Let $, discretize R”, by applying f, separately to each coordinate. Let y, = g,. 
The law of y, given 9, is easy to compute, because all the probabilities are 
discrete; call the result Q,(x, dA). Proposition 4.1 says that if @,(x, dA) con- 
verges to a Markov kernel Q(x, dA) as k > œ, then Q(x, dà) is the posterior. 

Discretization arguments can be applied with more complex parameter spaces 
A. To prove Lemma 2.1 by discretization, let A consist of all pairs À = (8, F), 
where @ is real and F is a probability on the line. Let P, on R” be F shifted to 
the right by 0, raised to the nth power. Let w,(A) = (f,(@), 8,()). Lemma 3.1 
can be handled the same way. 

For Proposition 4.1, let (Q, F, P) be an arbitrary probability triple. Let X 
and Y be complete separable metric spaces. In the applications above, ¥ is the 
data space and Y the parameter space; Q = Yx X and P = P, Let X and Y be 
Borel mappings from Q to ¥ and Y, respectively: In the applications, these are 
just the projections. Let ¢, be a Borel mapping from Ẹ into itself such that. 
p(x) > x pointwise as k — oo. Let y, be a Borel mapping from ¥ into itself 
such that y,( y) — y uniformly as k > oo. Let Q,(x, dy) = R,(¢,(x), dy) be a 
regular conditional distribution for ¥,(Y) given p(X) = $,(x). Suppose Q(x, dy) 
is a Markov kernel. Suppose Q,(x, dy) > Q(x, dy) weak star for each x as 
k > œ. For certain discretizations, Pfanzagl (1979) shows this convergence is 
automatic. 


PROPOSITION 4.1. Q is a regular conditional distribution for Y given X. 


Proor. Let g be bounded continuous on ¥, while h is bounded and uni- 
formly continuous on Y. Then 


falo, hew,) dP 
(4.3) 


= felo INAY] aP. 
As k > œ, the right side goes to {og(X)h(Y) dP; on the left side, ho y, can be 


replaced by A, with only a small error. Then the left side goes to 
fog( XQ, h) dP. 0 
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In the applications, it must be shown that ,(A) > A as k > oo uniformly in 
A. The y, of interest is the lumping function g, defined above. The key fact is 
(4.4) below. Let d be the Lévy distance between probabilities on R: so d(F, G) is 
the inf of e > 0 with F(x) < G(x + e) + e and G(x) < F(x + ©) + e. Then it is 
easy to show 


(4.4) d|F,g,(F)| <1/k forall F. 
That is, gF) > F uniformly in F. 
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THEORY OF PARTIAL LIKELIHOOD! 


By Winc Huna Wong 


University of Chicago 
A general asymptotic theory is developed for the maximum likelihood 
estimator based on a partial likelihood. Conditions are given for consistency 
and asymptotic normality, and a method is provided for the calculation of the 
asymptotic efficiency of the estimator. The implications of the general theory 
are examined in special cases such as inference in stochastic processes, Cox 
regression models, and AR processes with missing segments. 


1. Introduction. Let y bea realization of the random vector Y with density 
fy(y; $} depending on a vector parameter ¢ = (6; 7). Consider the situation 
where Y, perhaps after transformation, can be partitioned into two components, 
Y = (V, W). Correspondingly, the full likelihood will then factorize into a margi- 
nal likelihood and a conditional likelihood: 


(1.1) fy(x; $) = fyle; o) fav(ule; >). 


We will suppose we are interested only in inference about 6; 7 will play the role 
of a nuisance parameter. In complicated problems, the dimension of 7 may be 
high, and the application of maximum likelihood estimation may lead to mislead- 
ing results. If in such situations there is a factorization (1.1) such that only one of 
the two factors involves 8, then it is often helpful to use just that factor and 
disregard the other (which involves 7). Examples and development of marginal 
and conditional likelihood methods can be found in Kalbfleisch and Sprott (1970) 
and Andersen (1973). 

Tt is clear that if one uses only one factor in (1.1) while the ignored factor 
involves both ð and 7, then one has not used the full information (about @) 
contained in the observations. In exchange for the loss of information we achieve 
simplicity in analysis resulting from the elimination of nuisance parameters. 
There is also a gain in robustness of validity because the ignored factor in the 
likelihood does not have to be specified precisely. In applications these tradeoffs 
must be weighed carefully. 

In the development of methods based on appropriate factorization of the full 
likelihood, the introduction by Cox (1975) of the concept of partial likelihood is 
an important milestone. Consider the case when Y can be transformed into a 
sequence, 


(1.2) y = (Wis Xir- Wps Xnr- --, Wy XN); 
the partition being chosen so that the density of x, conditional on all the 
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previous variables depends only on @. The full likelihood factorizes accordingly 


N N 
(1.8) fly) = fe senida) | I ies) 
where d, er (Wi, Xise Wnr Xn-1)s Cc, = (w, Xisor es Wn-1 n-i: W, p): Cox called 


the second product in the right-hand side of (1.3) the partial likelihood of 6 based 
on X in the sequence (Wp, Xa)n=1, .., N- 


EXAMPLE. Suppose we observe J disconnected segments of a Markov process 
[n OS mh Og madee la Se „b 80 z’s between Zm, and z,,, etc. 
are “missing.” The joint pdensity for the data i is given by (taking z,, = constant) 


(1.4) Ti] ene 2m, pi f2nl2q— dl 


Suppose that within the observed segments the one-step transition probabilities 
are pren pi P,(x, y) with a parameter 0. If we let w, = Zap X1 = [Zm 41t Zm h 
We = Zp = [Z,,+1+++) Zm, h ete, then the partial likelihood based on X is 
simply vite ‘om 41DelZn-1» Zn)l- In many situations the missing segments may 
have a different probabilistic structure from the observed ones. In such cases, the 
remaining products in (1.4), namely IT?, sai l(z n,|Zm, will typically be difficult to 
handle because of nuisance parameters. This example is discussed further in 
Section 6.1. 


The purpose of the present paper is to develop an asymptotic theory for 
maximum likelihood estimation based on a partial likelihood. The advantage of 
being able to use a partial likelihood is clear: one then has great flexibility in 
obtaining a factorization (1.3) such that the loss of information due to the 
ignored factor is small. 

An equivalent way to define a partial likelihood is as a eae TIN fal XnlCn) 
of the densities of the conditional experiments x,|c,, x2|Co,...,X,|C,, where the 
o-field generated by c,,,, contains that generated by c,, n = 1,2 ,--.. This nested 
conditioning requirement plays a key role in our development of the asymptotic 
theory: it implies that the scores constructed from the conditional densities form 
a martingale differences sequence. This means that the information contained in 
the different conditional experiments are not redundant. Now, the likelihood for 
a stochastic process x,,...,%, can always be written as a partial likelihood. Thus 
the MLE theory for stochastic processes is a nontrivial special case of the par- 
tial likelihood setting. This area has, of course, been studied extensively; see 
Billingsley (1961), Silvey (1961), Rao (1966), Bhat (1974), Crowder (1976), Caines 
(1975), Basawa, Feigin, and Heyde (1976), Hall and Heyde (1980), and Basawa 
and Rao (1980). Billingsley (1961) had already recognized the importance of the 
martingale differences structure of the conditional scores and proved asymptotic 
normality of the MLE using a martingale central limit theorem developed by 
himself. Our treatment of asymptotic normality in Section 4 for the partial 
likelihood MLE is an extension of the treatment used in the above mentioned 
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works. On the other hand, our approach to the consistency problem is quite 
different from the one used in the above works. To draw an analogy with the two 
main approaches to consistency in classical MLE theory in the i.i.d. setting, the 
approach used in the above works (except for Caines, who treats only a finite 
parameter space) is the Cramér approach (Cramér, 1946), namely, to exhibit a 
consistent sequence of solutions to the likelihood equation, while the approach 
adopted in the present paper is that of Doob (1934) and Wald (1949). The unified 
treatment for finite and continuum parameter spaces and the avoidance of 
differentiability and uniqueness conditions are two main advantages of the 
Doob-Wald approach. This consistency theory is developed in Section 2. The 
theory is illustrated in Section 3 by several examples. The example of the Cox 
model is included for obvious reasons. The generalized autoregression models are 
studied partly because of their potential usefulness in modelling nonnormal time 
series data. 

Section 5 deals with the theory of efficiency. Our starting point is the classical 
work of Bahadur (1964), Hajek (1970), and others on the Fisher lower bound for 
regular estimators under the locally asymptotically normal (LAN) condition of 
Le Cam (1960). This theory is nicely summarized in the recent monograph of 
Ibragimov and Haésminskii (1981). The main problem we face is how to handle 
the nuisance parameter, particularly if it is infinite dimensional. The criterion for 
efficiency adopted in this paper is an extension of the classical one for the finite 
dimensional case. It is also used by some previous authors, e.g., Stein (1956), 
Lindsay (1980), and Begun, Hall, Huang, and Wellner (1983), mainly in the case 
of i.i.d. observations. In Section 5.2 we present a simple argument showing why 
the minimal Fisher information (Lindsay, 1980) provides a lower bound for the 
asymptotic variance of regular estimates in the general case. This argument 
represents a precise formulation of Stein’s argument which predates the rigorous 
theories of Bahadur and Hajek. We also study the method of calculating the 
minimal Fisher information by taking the limit in a sequence of finite parameter- 
izations and provide conditions under which the method works. At present this 
seems to be the only systematic method of calculation in the general case. 
Although it is believed that in most cases the minimal Fisher information is an 
appropriate benchmark for efficiency comparison, this has never been rigorously 
established. We have only been able to provide a partial justification, as discussed 
at the end of Section 5.2. 

The efficiency theory is illustrated in Section 6 by an in-depth study of the 
example of the segmented AR process. Most of the general points made in 
Section 5 find concrete representation in this example. The calculation shows 
that the partial likelihood is not fully informative in the random level shift case, 
even if the shift density is assumed unknown. This result is, at least initially, 
rather unexpected. Another example studied in Section 6 is the Cox regression 
model. 

Consistency and asymptotic normality in the Cox regression have been inten- 
sively studied in recent years: Tsiatis (1981), Liu and Crowley (1978), Andersen 
and Gill (1982), Slud (1982), Bailey (1983), and Prentice and Self (1983). In this 
paper we use the Cox model mainly as an illustration for the general theory. We 
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assume that the covariates are nonrandom, time independent, and can take only 
a finite number of values. In this respect our treatment is more restricted than 
some earlier ones. On the other hand, an arbitrary risk of the form u@, z) is 
allowed rather than the usual choice e?`? used by all authors except Prentice and 
Self (1983), who allow a form r(@- z} where r is a general function. Within our 
model we give explicit calculations of the asymptotic distributions. It is hoped 
that the ordinary differential equations derived in the appendix are a useful 
addition to the literature. 

Our calculation of efficiency in the Cox model reproduces some of the results of 
Efron (1977) and Oakes (1977), but our method is entirely different from theirs. 
Although implicit in Efron (1977), the rigorous derivation of asymptotic efficiency 
of the partial likelihood in the Cox model has been given only recently by Begun 
et al. (1983) under the iid. covariate assumption; see also Pfanzagl (1982). Our 
derivation in this paper is under the assumption of nonrandom covariates. The 
discussion also appears to be the first systematic treatment of efficiency for 
general risk forms. 


2. Consistency. 


2.1. Finite parameter space. In the classical framework of i.i.d. observations 
where x,, Xo,..., Xy are iid. with common density f(x; 0); for any fixed 8 # b, 
if 7,(9) = log( f6,(xn)/fe(*,)), then the Kullback—Leibler discriminatory informa- 
tion between 6, and @ contained in x, is given by 


i,(8) = Eo(74(9)) = f feos to,/fe) dx > 0. 


The variance j,(@) = Varg (7,,(8)) is also independent of n. Hence for each 8 # bo, 
we have 


(2.1) Ty(8) = Li,(6) = Ni,(6) > œ, 


(22) ee Eh) = Nj,(0) = 0( 18(6)). 


To see the meaning of these two conditions, denote by Ry(6) the logarithm of 
the likelihood-ratio, ie, Ry(0) = log (N(x, D/C) = ENr (0), then 
a EPS Leibler discriminatory information contained in x,,...,x, is 

R n(8) = In(0), and the variance of Ry(@) is just Jy(0). Clearly, 2.1) er 
a imply the divergence (to infinity) of R,(@), which in turn implies the 
consistency of the MLE if the parameter space © is finite. 

Now the basic structure of the partial likelihood framework is a sequence of 
conditional experiments x,|c,,...,%j|Cy, where the o-field generated by c,,, 
contains that generated by c,. In each experiment, c„ is regarded as fixed and x, 
has a conditional density f,(x,|c,), the partial likelihood is nothing but the 
product of such conditional likelihoods (defined up to a multiplicative constant 
independent of 8), i.e., PL « TIX, fp(x,[c,)- 
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To extend the above argument, let 


r,(9) = log( fo(Xnlen)/felXnlen))» Ry = D Tas 


N 
i,(8) = E74(8)en), ee 
JO) = Var(r(0)len) EA 
OY OW), My =¥m,(8). 


Note that R» is the logarithm of the partial-likelihood ratio, i.e., Ry(@) = 
log(PL(4,)/PL(@)). Conditional on c„, the discriminatory information (between 
6, and 6) contained in x, is just 7,(@); for this reason the sum Iy = Ni, will be 
called the accumulated Kullback—Leibler information in the sequence of condi- 
tional experiments x,[c,, X9|Co,.--, X,|¢€,- In general, both the conditional infor- 
mation 7,(@) and the conditional variance j (0) are random variables; they 
reduce to constants when x, is independent of c,,. 

When the parameter space @ is finite, the divergence of Ry(@) for every 
@ + b, implies the consistency of ĝ, the value that maximizes the partial likeli- 
hood. The following theorem provides sufficient conditions for the divergence of 
Ry. 


THEOREM 2A. Suppose F, #,,... is a sequence of increasing o-fields, 
Ry = <r, and, for n=1,2,..., r, is measurable with respect to F, i, = 
Er IF.) Jn = Varr, |F,). If there exist constants § > 0, a, too such that 


(2.3) PUiJay> 8) > 
(2.4) Jy/aN >p 0 
then Ry/In >p 1. If only (2.4) holds, then an My = an (Ry pa Iy) =>p 0. 
Proof. Let Ay = {In/ay > 5}, then 
The second term goes to zero by (2.3). To estimate the first term, write 
N 
Ry a Iy = L Mps 
n=l 


and let m%,, = m,x(dJ,/a%, < 1) where x(-) is the indicator function. Since 
PMi =m,, Y n=1,...,N} 2 P{Jp/aĝ <1} > 1, it suffices to estimate 
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P(j\XN_,mt,,| > eday). Now, 
E(m*,|F,-1) = x(d,/a% < 1)E(m,|F,_1) = 0 
Var(m*,,|F%,—1) = 4 J,/ax < 1)Var(m,|F,_1); 
Jy T > Var(m*,,|F%,—1) s Ja. 


nel 


Hence, ag? J < ay?Jy >p 0. Thus, using boundedness of J, /a%, one obtains 








N 1 N 1 
Y mk,|> ay} < ave 2 m/e = 2g? E( dxt/ay) 70. 0 
n=l nel 








In the above theorem, since a, > 00, (2.3) and (2.4) implies 


(2.5) Iy >p œ 
and 
(2.6) Jy = opl If). 


The conditions (2.5) and (2.6) seem to be the natural extension of conditions (2.1) 
and (2.2); however, until now a proof of the divergence of Ry under.(2.5) and 
(2.6) has not been obtained. In any case, the meaning of the conditions is clear: 
(2.1), (2.3), or (2.5) represent “accumulation of information,” (2.2), (2.4), or (2.6) 
represent “stability of variance.” Note that for Î to be consistent, the conditions 
in Theorem 2A must be satisfied by I,(@), Jy(@) for each 8 # 6, in the finite 
parameter space. The constants a, may depend on @ as long as a,(@)*0o for 
each 8. 

In some applications, such as the Cox regression model discussed below 
in Section 3, it is necessary to formulate the partial likelihood in terms of tri- 
angular arrays, i.e for each N, there are conditional experiments 
ENCE), M1 0WN), n., KOK, but for the same n, x and cM need not be 
the same as x" and c\ if N = N’. For the discussion of weak consistency or 
asymptotic distribution, it does not matter whether the array is single or 
triangular. For simplicity we will always write x, and c,, with the understanding 
that in the triangular array case x, and c,, may depend on N. 

In the single array case, the following result of Neveu (1965, page 148) is 
relevant for strong consistency. 


surely on the set Q, = {Iy tœ, L%_,(j,/12) < œ}. 


To apply this to partial likelihood, we must take r, = 7,(0) = 
log( fs (%nlen)/fo(Xnlen))s hence Q, in Lemma 2B may depend on @. If "PQK #)=1 
for 6 F 4 in the finite parameter space, then by Lemma 2B we have 
Paming ao R (8) > œ) = 1, from which strong consistency of 6,, follows readily. 
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In the classical case of independent observations, both i,, j, are constants, so 
P(Q,(9)) is either 1 or 0. In the general case of partial likelihood, the whole range 
of values in [0, 1] is possible for P(2,(8)). Likewise, the set 2, = (8, > 6} may 
have probability other than 0 or 1. We call Q, the consistency set, and its 
probability P(Q,) the level of consistency. By Lemma 2B, a lower bound for the 
level of consistency is P(N, «o8 (8). 


2.2. Compact parameter space. In this section, we assume the parameter 
space to be compact. Extension to a general parameter space will be taken up in 
the next section. 


THEOREM 2C. Suppose © is compact, and suppose that for any 8 + 0), there 
exists an open neighborhood O, of @ whose closure G, does not contain 6,, and 
that there are constants ô > 0, ay t œ (which may depend on 0) such that 


(2.7) P( jint, I(6')/an > 8) 1, 


(2.8) Jy(8')/a2,>p0 forall 0’ € G,, 


The distribution of ax'’My(0’) is tight in C(G,), where My = Ry — 
I, and C(G,) is the space of continuous junctions on Gg. 


Then 6x, > bp. 


(2.9) 


Proor. (i) First we show that (2.7)—(2.8) implies that 
P( j Ry(0) < o) > 0. 


To see this, observe that by the argument in the proof of Theorem 2A, (2.7) and 
(2.8) together imply that the finite dimensional distributions of a;'M,(*) con- 
verge to those of the random function degenerate at 0. Hence under the tightness 
condition (2.9), a 'M,(+*) > 0 weakly in C(G,), yielding the desired result. 

(ii) To prove the theorem, let O, be any open neighborhood of 6, and 
consider the compact set © \ O, By compactness © \ O,, can be covered by a 
finite number of open sets O, , ..., O,,, each of which satisfies conditions (2.7)—(2.9). 
Hence by part (i) of this proof, 


P( inf Ry(6’) < 0) +0 fori=1,...,k, 
FEG, 
whence 


k 
(2.10) P( jint, Ry(0") < 0) <5 P( int Ry(6’) < 0) >0 


i=l 


as N — œ, giving the desired result. 0 


In typical situations the parameter space @ is also endowed with a metric or a 
linear structure. If © has a natural metric, it is convenient, and we will always do 
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so, to take O; in Theorem 2C to be an open ball centered at 8. If further, © is a 
subset of a normed linear space, then the following criterion for tightness is 
useful. 


LEMMA 2D. If © is a compact subset of a normed linear space, then 
condition (2.9) in Theorem 2C can be replaced by 


With probability 1, M,(0) has Frechet derivative VMy(8) such that 
(2.9a) for some constant K > 0, P| sup ay '||VMy(6’)|| > K) > 0. 
EG, 


Proor. Since Gy is taken to be a closed ball, any intermediate value between 
6, and 6, also lies in G, if 6, and 6, are in G,. Hence under our assumptions 
tightness follows readily from the intermediate value theorem and Theorem 8.2 
of Billingsley (1968). 0 


2.3. General parameter space. To cover more general parameter spaces, a 
typical approach, introduced first in Wald (1949), is to consider conditions that 
guarantee that @ will eventually be confined to a compact subset of ®©. These 
types of conditions can be called conditions of “essentially compact parameter 
space.” In this section we consider the condition 


There exists a compact subset K of O, such that 6, € interior of K, 


(2.11) : 
and P( int Ry(8) < 0) 0. 
It is clear that under this condition, if the local conditions (2.7)-(2.9) of Theorem 
2C are satisfied for every 8 + b, in K, then we still have 6y —>p 9). The proof is a 
straightforward extension of that of Theorem 2C. 
The following theorem concerns a special case, covering a variety of applica- 
tions, in which essential compactness is automatically satisfied. 


THEOREM 2E. Let © be a convex set in R”, b, € interior of ©, and L,(@) 
the logarithm of the partial likelihood. If the local conditions (2.7)-(2.9) are 
satisfied for all @ + 6@,, and 
(2.12) P(Ly(@) is strictly concavein 0) =1 forall N, 
then (2.11) is also true, and hence Îy >p 4. 

Proor. Let O, and O, be open balls centered at 6) with radius p, and po, 
respectively; p, and p, are chosen such that p, < p, and O, C interior of O. Let 


G denote the closure of O, and fy denote the @ in G, that maximizes Ly(0). 
Since Ry(@) = Ly(@) — Ly(8), we have 


{ int Ry(8) < o} c {Ly(8*) > Ly(6) for some 0* ¢ G,} 


by concavity 


Cc {Ly(8**) = Ly(6)) for some 6** € G,\ 0,}. 
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This last set is seen to have probability tending to zero by applying the result 
(2.10) with G, as the parameter space. O 


Two remarks about the conditions: (i) the condition (2.12) can obviously be 
relaxed to require only that P(L,, strictly concave) — 1; (ii) it is clear that the 
same proof will go through if @ is a convex set in a topological vector space and 
6, belongs to the interior of a compact subset G, in ©. However, such a 
generalization is only superficial—if every point in a topological vector space is 
required to have an open neighborhood with compact closure, then the topologi- 
cal vector space must be finite dimensional. 

An important class of models where the concavity condition (2.12) holds is the 
class of natural exponential families 


(2.13) fo(xale,) = Aa(x,)e7* 9 > w.r.t. a measure »,. 


The functions ha b,, the measure »„, and even the range of x„ may all depend on 
Ca. But for any given c,, 


b,() = log f h,(x)e* ? dv,(x) 


is clearly a strictly convex function of @ if i dv n(x) is not a degenerate 
distribution. Hence Ly(@) = (UNx,)- 6 — ENb (0) is strictly concave if 
h,(x) dv,(x) is not degenerate for at least one n < N. 

To apply Theorem 2E, one must also check the local conditions (2.7)-(2.9). As 
will be seen shortly, the verification of (2.7)—(2.8) for nonstationary cases can 
involve considerable work in each specific model. We now argue that for the 
natural exponential family model, the tightness condition (2.9) is automatically 
satisfied whenever condition (2.8) is satisfied. To see this, first use properties 
of exponential families to check that E(( 0/00,)m,(8)\c,) = 0 and 
Var((0/08, )m,(9)\c,) = Var(x,,|¢,), where x, is the ith component of x,,. Tf 
(2.8) is true then certainly an EN Vax, Jen ) >, 0, which is sufficient for 

an d/90,)My(8) = an L_(0/00,)m,(8) >p 0. Thus condition (2.9a), and 
ae condition (2.9), are satisfied. 


3. Examples. 


3.1. Generalized autoregression. The normal theory linear model with de- 
pendent variable x and regressors z',...,z? can be written as Air a“ Xy ~ 
independent normals, Ex, = y,, Varx„= 0°, where y, = L?_,0,z'. To “handle 
time series data, let the regressors be lagged variables. Then we have the normal 
autoregressive model: given x,,...,X,—1, Xn is normal with mean y, = LP_,6.x,_, 
and variance o°. 

The distributional assumption of the linear model can be relaxed, normality 
may be substituted by any location scale family (with second moments) without 
affecting the asymptotic distribution of the least-squares estimates for @. The 
constant variance assumption is, however, quite crucial; for this reason the linear 
model is not appropriate for most discrete data. For example, for binary data x it 
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is found useful to consider models in which the Probit or Logit of x, depend 
linearly on the regressor z,,. By allowing a suitable parameter of the distribution 
of x, to depend linearly on z,, the scope of the model can be extended to cover 
diverse types of data. Such “generalized linear models” are typically applied to 
situations where x,,...,x, are independent, and parameters are typically esti- 
mated by maximum likelihood [see McCullagh and Nelder (1983) for develop- 
ments of these models]. It seems natural that, to handle time series data, we can 
choose the regressors to be lagged values of x,,. The resulting model will be called 
a “generalized autoregressive model.” It bears the same relation to the gener- 
alized linear model as that of the (normal) autoregressive model to the (normal) 
linear model. 

If we restrict attention to cases where the conditional distribution of x, given 
Cn = (%1,-+-;X,_—1) belongs to an exponential family, the generalized autoregres- 
sive process can be written in the form: 


fol Xnlen) = Rn(%p)exp{x,- ¥n(9) — 5,(6)}, 


where 

yn(9) =y(n,(8)), — 5,(8) = (y,(4)), 
and 
(3.1) na(8) = 0+ Eia 


t=] 


y, 6 are known functions. Some special cases are listed in Table 3.1. 

The conditions for ergodicity in Table 3.1 will be derived in Appendix A.1. 
Under ergodicity it is possible to obtain fairly general conditions for consistency, 
as we now discuss. By familiar results for exponential families, 


(a) r, = —x, Ay, + (8, — 09), 
(b) i, = E(%qlen) = ba — bp — BC nh) Avn» 
(c) jy = Var( ralen) = b(n) (Avn), 
(d) m, = — (x, — b'(Y8)) AY, 
where Ay, = y, — 7° and the superscript ° denotes evaluation at the true value. 


(3.2) 


TABLE 3.1 
Conditional Region of ergodicity 
distribution b(y) by) Yen) (p = 2 cage) 
1. Normal ły? 1 7 b = 0,1 — 6,B —- 0 B? 


must have roots 
outside unit circle 
e eY q a bounded polygon 
2. Bernoulli iIn(i + e” -— — ; 
ou (i +e”) (5l <5) inf] (see Appendix A.1) 
3. Poisson eY e ln(1 — e7?) 4%, 6, 8, > Q is sufficient 





98 WING HUNG WONG 
We assume the following smoothness condition on y and b: 


jy) and b’(y(-)) are uniformly bounded away from 0 and oo, for 
P 
2) 4 E E E gia OST PA 
1 


Under this condition the asymptotic behavior of Iy and Jy depends only on 
that of L(An,)*. To satisfy conditions (2.7)-(2.8) in the consistency theorem 
(Theorem 2C), it suffices to find constants ay —> œ such that, for 6 + 0°, 


N N 
(3.4) ay’ L (An,)? >p 0 but ap' 2 (ån,)? is locally uniformly bounded 
i 1 1 
away from zero. 


If we define Y, = (X,_p:---»%,-1), then {Y¥,},o1,2.., is clearly a Markov 
process with stationary transition function. The process {x,,} is said to be ergodic 
if {¥,} is indecomposable and admits a strictly positive probability density 
invariant under the transition function. 


LEMMA 3A. Suppose that the generalized autoregressive process {x,} is 
ergodic and that E*x,,_,x,_, exist for 1 < i<j < p, where E* denotes expecta- 
tion with respect to the invariant distribution, then condition (3.4) holds for any 
initial probability density. 


Proor. (i) First we show that a law of large numbers applies: i.e., for any 
measurable function g(Y) = g(x,.--,%,) such that E*|g(Y)| exists, we have 
P(A) = 1 where A = {N EMg(¥,) > E*g(¥)}. This follows from Birkhoff’s 
ergodic theorem by the following amusing argument (pointed out to the author 
by R. R. Bahadur). Define G(y,) = P(A[Y, = y,). From the Markov property 
G(y,) does not depend on the initial density pọ, and P(A) = {G(y,)po(y,) dY- 
Now if pọ = p*, the invariant density, then {Y,},.1,2... is ergodic as a strictly 
stationary process. Birkhoff’s theorem then implies that {G(y,)p*(y,) dy, = 1. 
Since p* is strictly positive, this equation can be true only if G(y,) = 1 ae. 
Hence P(A) = {G(y,)po(y1) dy, = 1 for any initial density po. 

(ii) Now we turn to the main proof. Define a(@) = E*(An,)*, al0, p) = 
E*[intg_»)<,(A%,)?] where An, = 7,(8) — 1,(8°) = (0) — OF) + ER (0, — 6°) 
xa and Aj, = 7,(6) — 7,(0°). These expectations are independent of n since 
they are taken with respect to the invariant distribution. It is easy to see that by 
the monotone convergence theorem, a(@, p) > a(@) as p > 0, and in part (iii) of 
this proof we will show that a(8) > 0 for all @ + 6). Hence there exist a e > 0 
such that a(6,2) > 0. The law of large numbers in (i) then gives 
NN [inf p< {4%,)"] > a(8, £) > 0 a.e, and NE (An,)? > a(8) a.e. Con- 
dition (3.4) follows immediately. 

(iii) It remains to show that a(@) = E*(An,)* exists and is strictly positive. 
To see this, write ô= bo F 9; d= (9, ae a ee p= E*(Y), and 2= 
cov*(YY’). Then direct calculation gives a(0) = (8 + d'p)? + d’Zd. Since the 
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strict positivity of the invariant density implies that Z is strictly positive 
definite, it is easy to see that a(#) > 0 unless § = 6°. O 


By the same arguments used in part (ii) of the above proof, it is easy to see 
that condition (2.9a) is also satisfied under ergodicity. Thus if the parameter 
space is taken to be any compact subset in the ergodicity region, Theorem 2C can 
be applied to yield consistency of 8. 

When the state space of {Y,,} is decomposable into several ergodic classes, the 
above theory can still be applied to each ergodic class. Another type of non- 
ergodicity is much more difficult to handle, namely, when the process exhibits no 
steady state behavior, such as the nonstationary normal AR process investigated 
in the next section. 


3.2. Nonstationary normal autoregressive process. We now return to the 
first model of Table 3.1, i.e., the conditional E of x, given previous 
values is normal with mean 9, = 6,%,_, + Bän- +` -+0 anp and variance 


1, but we no longer require to lie in the region of oiai. 


LemMMa 3B. For the normal AR process, ergodic or otherwise, condition 
(3.4) always holds. 


Proor. It is easy to check that if {x,,} is an AR process then 7, must be an 
ARMA process with the same autoregressive polynomial. Consider the unique 
factorization of this AR polynomial 


J 
(1 — 0°B — --- -9°B”) = J] (1 —A,B)™ 
jul 


The asymptotic behavior of the ARMA process depends on the positions of 
the 4,’s and their multiplicities. To simplify notations, write A, in polar form, 
i.e., X, = p,e" (here i= y—1), and order the A,’8 80 that” Pi = Po = 
= py, > Paget > "+ > py and m,2m,2 ++ > myy Let p = p, = max, g JP; 
and m = MAX, ,,m,, and consider three cases: 

(i) p < 1: in this case, the process is ergodic, there is no difficulty. 

(ii) p > 1: this is the so called “explosive” case, in which the variance increases 
exponentially. The problem is to determine the exact rate of increase. By rather 
elaborate calculation, it can be shown that 

h% h 
Var(p-“N-™-Ynn) ~ L È r coslr,; + (w, - o )N), 
:=1 J=1 
where the amplitudes r,, and the phases »,, are continuous functions of @ and 68°. 
From this it follows that (3.4) holds with, say, ay = (PNPA, 

(iii) p = 1. This is the nonexplosive nonstationary case. It can be shown that 
Van N~("-1/2)n.) > c, where c depends continuously on 6 and 6°. Hence (3.4) 
holds with a, = (N™~1/*)*”, The most complete result on this case can be 
found in Tiao and Tsay (1983). O 
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For normal AR processes, the tightness condition (2.9) is trivial to verify, since 
it has a natural exponential family structure. It is also easy to see that the 
log-likelihood is strictly concave. Thus by Theorem 2E, 6 is consistent for 6°, 
without requiring any ergodicity or compactness condition. 


3.3. Proportional hazard models. This is the model which led Cox (1972, 1975) 
to formulate the general idea of partial likelihood. It is thus of interest to 
examine it in light of the preceding discussions. In the Cox model one observes 
failure times of a group of individuals subjected to censoring. Suppose the 
(uncensored) failures occur at distinct times ty) < +++ < ty). Let R, be the risk 
set at time tny, Le., the set of individuals who have not failed or been censored by 
that time. Furthermore, suppose that for each individual one also observes a set 
of explanatory variables z = (z,,...,2,). The distinctive assumption of the 
proportional hazard model is that the hazard function for an individual at risk at 
age t is 
(3.5) A(t, z) = Ag(é)w(4, z), 


where A, is the base line hazard measuring the hazard at 8 = 0, w(8, z) is a 
weighting function (or relative risk factor). The interest is usually in the estima- 
tion of the “regression coefficient” @ = (6,,...,6,), which characterizes how 
the explanatory variable z affects the hazard, with 0 = 0 corresponding to 
the case of no effect. If w(@, z) depends only on the inner product 8 - z, i.e. 
w(9, z) = w(@ - z), then the model can be called a Cox linear regression model. If, 
further, w(@ - z) = e? 7, then we have the natural Cox model. 

To obtain a partial likelihood for @, let x, specify the covariate value 
associated with the individual who fails at ¢,,,,, and let c,, denote all death and 
censoring times up to and including time ¢,,). If p,(z) is defined to be the 
fraction of individuals in R,, having covariate value equal to z and Z,, = the set 
of covariate values of individuals in R,, then the conditional likelihood of x, 
given c,, is, 

P(X, )W(8, £n) 

È p,(z)w(9, z) 
zeZ,, 
The partial likelihood based on x,|c,, n = 1,..., N is just the product of these 
conditional likelihoods. 

In this paper we will only study the important though special case when the 
explanatory variable z is discrete, i.e, z € Z = {z,..., 2}, note that each z“ 
in Z is a p-vector. Under this assumption the conditional likelihood of x, given 
c, is a function of only p, and 9, i.e fo(x,|¢,) = f(%,3 Pp» 9). Similarly, there are 
well-defined functions r, i, jJ, and m, such that, 


ra(0) = 1(Xqi Par 9), — my(B)m, = man Pa» A), 
tn(0) = i(p,, 9), JCO) = (Pps 8). 


Provided that w(6, z) > 0 is continuous in @ € © for each z € Z, the functions r, 
m, i, and j are each continuous in its domain. The domain of r or m is 


(3.6) fol=nlen) = 
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ZxSxX@, and the domain of i or j is SX ©, where S is the simplex 
S= {p € R*: p,>0,L* yp, = 1}. 

It is now clear that the asymptotic behavior of Iy and Jy depends on that of 
{p,, n= 1,..., N}. To develop the asymptotic theory, consider a sequence of 
experiments where N, the number of deaths in the experiment, increases without 
bound. In general, for any fixed n, the conditioning variables c and c° will 
be different if N < N’ due to the fact that in the larger experiment the risk set 
Rn) contains more individuals. To avoid confusion, the superscript (N) will be 
used if necessary. 

Let A‘%(t) be the vector of relative proportions of individuals in each 
covariate stratum who are still at risk at relative time t, where time is scaled by 
mortality experience, i.e., the kth component of A™ is defined by A(t) = 
PSN (z) if t = n/N and linear in between. Then 


N-Iy(6) = N7! x i(h™ (n/N), 6) ~ JiR) dt, 


1 


N 
N7'y(9) = N! YAM (n/N), 8) ~ f (h(t), 0) dt. 

n=1 0 
In Appendix A.2 it will be shown that under the regularity conditions stated, 
there exists a (nonrandom) differentiable function A to which A‘*) converges 
weakly in C[0, A]* for any 0 < A < 1. It then follows that 


N“'Iy(8) >p f 'i(A(z), 0) dt, 
NILO) >p [ ICE), 6) dt. 


The function h is determined by a system of ordinary differential equations 
which can be solved numerically. The use of these differential equations will be 
further discussed in Section 4. Similarly, 


N 
-1 ; -1 5 : (N) gr) > 1. “(ZF ger 
o a ae a 

If i(A(t), 8) > 0 for any (to, 9), then by continuity i(h(t), 8) > 0 for all (t, 8) 
near (fo, ĝo), i.e., there exists p small enough s.t. inf o'e] < ,t( A(t), 9’) > 0 for all ¢ 
near t,. The above limit is thus strictly positive. 

In the case when the parameter space is compact, the above results imply that 
conditions (2.7)—(2.8) are satisfied with a, = N. The tightness condition (2.9) is 
also easy to verify using similar arguments. The consistency of the partial 
likelihood MLE then follows from Theorem 2C. 

For the natural Cox model, w(@, z) = e? 7 and (3.6) becomes a natural ex- 
ponential family. If the distribution in the original population with respect to 
covariate stratum is not degenerate, then Theorem 2E implies the consistency of 
6, without requiring compactness of the parameter space. 
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4. Asymptotic normality. For obvious reasons, it is convenient to assume 
@ cC R”. Assuming that the following derivatives exist almost everywhere, we 
will write 


N 
1,(9) g log fy(x,|c,); Ly z ye 
1 


1(8) = DI,(9), 1,(8)= D*I,(@), — 1,(@) = D*l,(8), 


where D‘ is the ith order differential operator. For example, D%I,,(6,) is the triple 
array of third-order derivatives {D,,1,(9)|o-0,):, ),k=1,..., p) and for any p-vector 
e, D°l,(8) < (e)? = E, aD, nbn 9)).€,er- 

Let u,, = 1,(8)) be the conditional score for the experiment x,„l|€„, then under 
standard conditions for the conditional densities, we have 


(4.1) E(U jen) =0, t, = Cov(u,len) = E(—2,(4)len)- 
With U, = Eu, Vy = D™o,, the main asymptotic normality result is 


THEOREM 4A. Suppose 6 is consistent for 0, € interior of 9 c R”, and for 
each n, l, has third-order derivatives almost surely and (4.1) holds. Assume also 
that there are constants ay t œ and a neighborhood O of b, such that 


(4.2) ap Vy >p some p.d. matrix Q, 

(4.3) an(-Epnlb)) >p some p.d. matrix Q,, 

(4.4) P(ax? sup|Ly(8)| < M) 1 for some constant M, 
960 ` 

(4.5) az’ L E(u’) >p 0. 

Then i 


aX? (ôy — b) >p N(0, @7'QQ;"). 
REMARK. In many cases, if (4.2) holds then (4.3) also holds with Q, = Q; for 


example, a sufficient condition for this is 


N 
(4.6) ay’). Var(e'l,(9)eic,) >p0 for all unit vectors e. 
1 


PROOF OF THEOREM. By definition of § and Taylor expansion, 
(4.7) 0 = Ly(8) = Uy + Ly(O9) - (8 — 8) + Ly (8*) - (8 - %)°/2, 


where §* lies between @ and 6). Let By = —[Ly(8) + 1/2Ly(6*) - (6 — 9). 
Since 6 —> 6, by (4.3)-(4.4), ay'By becomes positive definite with probability 
tending to 1. Hence from (4.7) we have 


(4.8) aK? ( ~ 4) = [ay'By] (ap Un). 
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First note that [ay'By]~' >p Q;}, and thus the theorem will be proved if we 
can show that ap Uy >p N(0,Q). For any constant unit vector e, define 
t, =€: Up ther by (4.1) {t,},-1,2, .. are martingale differences with respect to 
the o-fields generated by {c,},.1,2.,.. and 


N 
ay’) Var(t,|c,) = ap'e'Vye >p e'Qe. 
1 
Furthermore, it follows easily from (4.5) that 


N 
ax?” E(t ilen) >p 0, 
i 


N N 
ap DE [#21 (|t,| > ea% jep] < eax?” E E(t,l*le,) >p 0. 
1 1 


Since all conditions for the martingale central limit theorem (Brown, 1971) are 
verified for the martingale ap Eft, it follows that 


N 
e- (ap Un) = ay? Vt, >p N(0, e'Qe). 
1 
Since this is true for any unit vector e, ap} Uy >p N(0, Q). 0 
Let us illustrate the theory with the proportional hazard model of Section 3.3. 
From (3.6), we have 
(4.9) L,(6) = log f,(x,|¢,) = constant + ce(8, x„) — b(8, Pa), 


where 
c(8, x,) =logw(8,x,) and (4, p,) = iog Zp,(2)e%®2), 
2 


We will also assume that 


(4.10) for each z, c(*, z) is three times continuously differentiable around 
: Go, 


(4.11) pı is nondegenerate, 
(4.12) cov(é(@, x,)l¢,) is p.d. whenever p, is nondegenerate. 
Then, using the results of Appendix A.2, conditions (4.2)-(4.6) can be verified, 
and hehce by Theorem 4A, 
N?(8y ~ bo) >a N(0,Q7*). 


Furthermore, the (i, j)th component of the (normalized) Fisher-information 
matrix @ based on the partial likelihood can be calculated in the following 
manner: with w? = w(, z™) and ¢,, = the ith partial derivative of c(8, 2) 
w.r.t. 0 at 8 = 4, 


(4.18) Q= f olhe) at, 
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where 


(a) v,,(A) = > CnC rSk ~ [Dease]( Lene): 


(b) The vectors g and h are related by g, = h,w*/(,hyw'). 

(c) The vector function A(t) is determined by the system of ordinary differen- 
tial equations specified in (A.13) of the appendix, with initial values 4,(0) = the 
proportion, at the beginning of the trial, of individuals having explanatory 
variate value z“*), 


The form of the differential equation is particularly simple when there is no 
wa 


(4.14) z h(t) = 7 g; (t)- g(t), k=1,...,K, O<t<1. 


In a case of e two sample problem the explanatory variable z is 0 or 1 and 
the hazard is A(t) of A (t)e?, depending on the sample to which the individual 
belongs. Let a =e, q = fraction of individuals in sample 0, then from 
(4.13)-(4.14) the sess Fisher information is 


(4.15) lim [N Var(4)]"* = Q= f'a(e)(a - e(t) dt, 


where 


a(t) = h(t)/(h(t) + a(1 — h(¢))), 
d 1 
qh) = yoga el), RC) = 


The differential equation in (4.15) is easy to solve numerically. Values of Q 
calculated from (4.15) for various values of q and a are presented in Table 4.1. 

The two-sample problem is also studied in Efron (1977) and Oakes (1977). 
Efron has derived a formula for the asymptotic Fisher information for the special 
case when the two samples have different exponentially distributed lifetimes, i.e., 
the baseline hazard A,(¢) is a constant function. This is formula (4.9) in his 
paper, reproduced here as 


1 —q)du 
(4.16) (Efron formula): [N Var()]~* = l EAA a 


TABLE 4.1 
Values of asymptotic information Q 


Q calculated using 
q a (4.18) (4.16) 
0.5 2 0.225345 0.225345 
0.5 5 0.159402 0.159402 


0.25 5 0.114305 0.114305 
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It is a byproduct of our derivation of (4.13)-(4.14) that in the uncensored case 
the asymptotic Fisher information Q does not depend on the baseline hazard 
A(t). Therefore, Efron’s formula, although derived under the constant A,(°) 
case, is actually applicable to the wider case of arbitrary A,(-), and must 
therefore agree with formula (4.15). The values of Q calculated using (4.16) are 
also presented in Table 4.1. There is little doubt that the two formulae are 
equivalent, although we have not found a direct analytical deduction of one from 
the other. 


5. Efficiency. In this paper we consider relative asymptotic efficiency as the 
ratio of asymptotic variances of estimators [see e.g., Le Cam (1953) and Bahadur 
(1964)]. Fisher-information is to be interpreted only through its relationship with 
asymptotic variances. Other interpretations exist and may be important, but are 
not discussed here. 

Let 1() = log f,( Wald), 1°(8) = log fo(Xql¢,), then by (1.3) the logarithm 
of the full likelihood can be decomposed as 


(5.1) LE) = E ($) + E KPO) = Lo) + LPO), 


the second term being the logarithm of the partial likelihood. 
There are three basic situations to consider. 


5.1. Finite dimensional nuisance parameter (i.e., 7 E R"). Suppose that each 
of the partial likelihoods LY and LY satisfies the regularity conditions of 
Theorem 4A. (Note that for L&P, the parameter ¢ = (6, 7) is p + r dimensional.) 
Denote the MLE of ¢ based on LY by $ and the MLE of 6 based on LẸ by 
8), then 


ax?(g™ ms $o) >ə N(0,S~*), 



































(5.2) 
ax? (ĝ® — 6) >a N(0,Q7?), 
where 
7 Soo | Son P Q + Hos: Hon 
So i Sin] T Hyg tg 
the p x p matrix Q and the (p + r) X (p + r) matrix H are defined by 
ILG N al al 
= — lim =I = lim =E p dà 
bi Í G a 30, 0 mu LE | 08, J a6; |j) 
(5.3) 
PLG N a a 
H,, = —plima;' =plimay)E = | id, |; 
á N | apap], 7 ~ | 36, |i 3$, | ; 














and the subscript 0 indicates evaluation at the true value ọ = ¢p. 
Thus, marginally, ax? (© — 6) >a N(0, S71), where S; , =Q + H; „ 
Ay., = Heo — Hy,H,, Hpo- It is appropriate to call the matrix S,.,, the “marginal 
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(Fisher) information” for estimating @ in the presence of y, and the matrix H a 
the “loss of information” in using L¥ instead of LY). 

Why is it sensible to compare §) to 6)? One justification is the classical 
result that Â©? achieves the minimal asymptotic variance among all “regular” 
estimators. We now describe this result briefly, in preparation for the discussion 
of the infinite dimensional nuisance parameter case. 

A sequence of parameter values {¢,} is called a “regular sequence” if 
aX(oy — po) > e for some e € R?*”. An estimate Ty (of 8) is called 


(i) regular in the Hajek sense if for all regular sequences {$,}, the distribution 
of aX?(Ty — 8y) under oy converges to a distribution independent of {$y}; 

(ii) regular in the Bahadur sense if the distribution of alf?(T,, — bp) under ¢, 
converges to a normal distribution, and for all regular sequences {ọn}, P(Ty = 
Onlon) > 2- 


By a regular estimate we mean an estimate regular either in Hajek or Bahadur 
sense. 


LEMMA 5A. If LY satisfies the regularity conditions (4.2)-(4.6) of Theorem 
4A (with ¢ as the parameter and suitable definition of the conditioning fields 
F Fa +++), then any regular estimate {Tp} has asymptotic variance larger 
than or equal to that of 6°”, 


The Bahadur part of the above result is in Bahadur (1967). The Hajek part 
can be obtained by a modification of Hajek’s original proof (Hajek, 1970) for the 
case with no nuisance parameter. Both authors make use of the LAN condition 
(Le Cam, 1960) which is satisfied under the hypothesis of the lemma. 

With respect to the calculation of H, ,, the following is a natural question at 
this point. It is plain that H, , is the limiting residual covariance matrix of the 
regression of ap DLO on ap PD, LY. Now Li) = LN), if we regress 
D,lf” on DIẸ” for each n and denote the residual covariance by h.y, ,,, will it 
be true that ap'EYhe „n > He ,? The answer is no. In general, the limit will 
only be a lower bound for H; ,, although that bound is sharp in the iid. case. 


5.2. Infinite dimensional nuisance parameter. In this section we change the 
notations slightly: à will denote the infinite dimensional nuisance parameter and 
y will denote a finite dimensional parameter to be defined below. The nuisance 
parameter space T is assumed to be an infinite dimensional manifold. For 
concreteness we take T to be a submanifold of a Banach space #. We will derive 
lower bounds for the variances of regular estimates of 8. These can then be used 
to provide upper bounds for the loss of information due to using a partial 
likelihood. The definition of a regular estimate is the same as that given in 
Section 5.1, but a regular sequence of parameter values {¢, = (@y, Ay)} i8 now 
defined by the property that al/*(¢, — $o) converges in the product topology to 
(€,, €,) for some e, E R’,e, E X. 

To obtain lower variance bounds, consider the true nuisance value A, as 
imbedded in a smooth parametric subfamily {A = A(n): n = (7,,...,7,) E some 
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neighborhood O, in R’}. For the parametric problem involving (8,7), one can 
calculate, as in Section 5.1, the lower variance bound S,_; for estimating @ in the 
presence of (7,,...,7,)- Intuition suggests that this should also be a lower 
variance bound for estimating @ in the presence of A. We now formulate this 
more precisely. We will use differential calculus in Banach spaces, see, e.g., Lang 
(1972). 

By a smooth finite dimensional parameterization of A we mean a differentiable 
map 7 —> A(7) from some neighborhood in R” to T. Of special interest below is 
the one-dimensional parameterizations ¢ > A(¢), ¢ € R’. These can be regarded 
as curves in I. To each curve is associated a tangent vector based on Ap, 
t = lim, .¢(€ — Eo) TACE) — A(&)). The set F of all possible vectors tangent to 
T at A, constitute a linear subspace of #’, called the tangent space at Ay. The 
dimension of a parameterization 7 — A(7) is the dimension of the subspace 7 
spanned by the partial derivatives of the map. For simplicity we will always 
assume that the parameterization is nonsingular, i.e., its dimension is equal to 
that of n. To obtain concrete results, consider the following regularity conditions. 


For any smooth curve ¢ —> A(&), the full likelihood LY with the 
(5.4) p+1 dimensional parameter ¢ = (6, ¢) satisfies conditions 
(4.2)-(4.6). 


(5.5) The elements of the limiting information matrix Q in (4.2) are 
*/ uniformly bounded for all curves with tangent vector lengths < 1. 

THEOREM 5B. Suppose (5.4) holds and let S; } be the lower variance bound 
for estimating 6 in the p + r dimensional problem with parameter ọ = (6,7), 
where ņn > A(n) is a smooth r-dimensional parameterization of à. If {Ty} is a 
regular estimate for 0 in the infinite dimensional problem with parameter 
o = (0,)), then {Ty} has asymptotic variance > Sș }. 


Proor. (i) First we show that {Ty} must also be a regular estimate for 6 in 
the p + r dimensional problem. To see this, let {(@,, ny)} be a regular sequence 
of parameter values in the p + r dimensional problem, then by differentiability 
of 7 > A(n), 


ak?(A(ny) — A(mo)) = aX?[(D,A(n°) < (ny — no)) + O(lnw — oD] 


This converges to a linear combination of the components of the derivative of the 
map 7—A(n), since al*(ny — nọ) converges to a vector in R”. Hence 
{(9n, A(ny))} is a regular sequence of parameter values in the infinite dimen- 
sional problem. The desired conclusion now follows directly from the definition of 
regular estimates. 

(ii) It is easy to check that if (4.2)-(4.6) are satisfied for ¢ = (0, €) for all 
one-dimensional parameterizations ¢ — A(£), then they are also satisfied for 
ẹ = (8, n) for any finite dimensional parameterization ņn > A(7). Hence under 
(5.4), Lemma 5A can be applied to conclude that {Ty} has asymptotic variance 
larger than S, 7. O 
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In the case when @ is a scalar parameter, it is an elementary fact, first pointed 
out by Stein (1956), that for any smooth r-dimensional parameterization y > 
A(7), there is a smooth one-dimensional parameterization € > A(£) which gives 
the same lower variance bound as in the r-dimensional case. Thus, to obtain the 
best lower variance bound for regular estimates of 6, it suffices to consider only 
one-dimensional parameterizations of À. A curve £* — A(é*) which yields the 
greatest lower variance bound is called a least favorable curve, the corresponding 
marginal information for 9 is called the “minimal Fisher information” by 
Lindsay (1980, 1983). 

Although easy to define, the minimal Fisher information may be difficult to 
compute, For the i.i.d. case, Lindsay (1980) and Begun, Hall, Huang, and Wellner 
(1983) give geometric insights as well as examples of computation. In the non 
i.i.d. case, very few results are available. We now investigate the following general 
method for computing the minimal Fisher information: consider an increasing 
sequence of parameterizations {1 > A(n'”)}, 1,0. ., here r denotes the dimen- 
sion of the parameterization. Clearly, the upper Gnformilion bound S, (r) 
calculated using the parameterization n —> A(n) will become smaller as r 
increases. By choosing the sequence of parameterizations appropriately, we hope 
that the limit of these bounds, lim, _, .,S, ,(7), will provide the minimal Fisher 
information. For which sequences can the minimal information be calculated by 
this method? 


THEOREM 5C. If @ is scalar, (5.4)-(5.5) hold, and a least favorable curve 
£* + A($*) exists, then a sufficient condition for S ,(r) to converge to the 
minimal Fisher information as r > œ ts the following: 


For any e> 0, there exist ry>0 such that r > r, entails that 


: 
=e” 

t=] 
tangent vector of the curve * > d(§*), and t denotes the ith 
partial derivative of the map 7") > Mq). 


<e for some a,,...,a,; here t* denotes the 








Proor. To each tangent t€ J, let us associate with it the numbers A(t) 
and B(t) as follows: suppose é is any curve with tangent equal to ¢, then 


EE 
el} 


Under (5.5), there is a constant K such that 


|A(t)| < Kiel, | B(t)| < Ka’. 


It can also be checked that A(-) is a linear map and B(-) is a quadratic map. 
Thus, both A and B are continuous maps. 


N 
A(t) = lim an LE 











N 
B(t) = lim an LE 
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Now consider the two-dimensional problem with ọ = (6, €). Here the corre- 
sponding bound for the marginal information of 6 is given by 


(5.7) Sy ¢ = Sop — B(t)”*A(t)”. 


Note that S, ç is a continuous function of t. 
In the 1 + r dimensional problem with ¢ = (6, n°”), the corresponding bound 
for the marginal information of 0 is given by 


(5.8) So n = See F Son Sin Syo- 


The bound (5.8) can be shown to be smaller than the bound (5.7) when ¢ in (5.7) 
is any linear combination of t, i = 1,..., r. Combining this fact, condition (5.6) 
and the continuity of Sy ; in t, the result follows immediately. O 


REMARK. ‘Typically, the least favorable tangent ¢* is difficult to calculate, 
and one verifies (5.6) by checking that the span of the partial derivatives {t{”, 
i= 1,..., r} becomes dense in the tangent space 7. 


We now discuss the case when @ is a p-dimensional parameter, it is now no 
longer possible to find a one-dimensional parameterization which is as difficult as 
a given r-dimensional parameterization 7 > A(7). Instead, it is only possible to 
find a p-dimensional parameterization  — A(&) which is as difficult as the 
r-dimensional parameterization. Thus we must search for the most difficult 
p-dimensional parameterization ¢* — A(€*), which gives the greatest lower bound 
(among p-dimensional parameterizations) for asymptotic variances of regular 
estimators of 8. Geometrically, the map * — A(€*) gives rise to a p-dimensional 
surface in T which we will call the least favorable surface for the estimation of 0. 
Any curve tangent to the least favorable surface at A = A, is the least favorable 
curve for estimating a particular scalar function of 0. The marginal information 
for 8 given by the least favorable surface is thus the least upper bound for any 
finite parameterization. We will still call this the minimal Fisher information for 
the estimation of 0. The general method of calculation outlined above can still be 
applied: simply calculate the p xX p matrix S,., for each 7‘) and pass to the 
limit. An obvious extension of Theorem 5C then guarantees that the limit is 
equal to the minimal Fisher information provided span{¢§”, i= 1,..., r} be- 
comes dense in the tangent space as r increases. 

Returning finally to the partial likelihood situation, if we have a sequence of 
parameterizations to which Theorem 5C applies, the marginal information for 
estimating 6 is Q + H,., where Q is the same for any parameterization of the 
nuisance parameter, and H, , decreases as the parametric subfamily is enlarged. 
To get an upper bound for the loss of information of the partial likelihood, we 
calculate H; , from LẸ?’ for the subfamilies and pass to the limit. Some 
illustrative examples are given in Section 6. 

We end this discussion with a remark on a difficulty of the minimal Fisher 
information as a criterion for efficiency comparison in the presence of nuisance 
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parameters. If an estimate achieves the minimal Fisher information bound, then 
there can be no other regular estimate with smaller asymptotic variance, and the 
estimate can, justifiably, be regarded as efficient. On the other hand, if the 
estimate has asymptotic variance larger than the inverse minimal Fisher informa- 
tion, should it then be regarded as inefficient? This question remains largely 
unresolved in the case of infinite dimensional nuisance parameters, since in this 
case it is not known whether there is any regular estimate which can achieve the 
minimal Fisher information bound. Results on some special cases of this problem 
can be found in Pfanzagl (1982) and Bickel (1982). 


5.3. Incidental nuisance parameter. There are examples, in the i.i.d. case, 
where “new components” of the nuisance parameter arise as new observations are 
made, in such a way that none of the components of the nuisance parameter can 
be estimated with diminishing error. In such cases the MLE is often inconsistent. 
Neyman and Scott (1948) called such nuisance parameters “incidental parame- 
ters.” 

Similar phenomena of inconsistency also occur in partial likelihood situations. 
Specifically, if ņ„ denotes the incidental parameter which appears only in the 
conditional likelihood f(w,|d,,), then the full log-likelihood is 


N 
LRO, m- Iw) = LY(0) + È log fy, (wld, ); 
nol 











and the likelihood equations are 
Q N al al) 
; 80 = 8, = 0, 


Given any 6, the value for 7, can be obtained from the second equation, giving 
na = 7,(9), a random variable whose distribution is unaffected by the collection 
of further data W1 Xanti Whio Inga. - Substituting 7(9) back in the first 
equation, we obtain the equation for ĝ: 


ILE 
38 


N al) 
(6)+ E g,(9) where g,(8) = —5-(9, fin(9))- 
n=l 
Now, although E[(dl””/08)9|d,,] = 0, because of the distribution of %(@), it is 
generally true that E(g,(0))\d,,) # 0, and hence the equation for Ô would lead to 
inconsistent estimates. 

On the other hand, if the conditions of Theorem 4A are satisfied, the use of 
only the partial likelihood L&(@) will of course produce consistent and asymp- 
totically normal estimates. Godambe (1976), Andersen (1973), and Lindsay 
(1980, 1982) have given some conditions, based on extensions of the concepts of 
sufficiency and ancillarity, under which conditional likelihoods are fully informa- 
tive. 


0 
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6. Examples of efficiency calculation. 


6.1. Missing values in AR processes. As a first example, suppose we observe 

J disconnected sequences of a time series: [z,,.-.,2n,], [Znp > Zma) 

laa peia Zm sb where n, < m, < ng < m; < =e <ny<my= N. Suppose 

that within the ‘segments the series follows a AR(1) model, i.e., z, = 02,1 + a, 
provided z, and z,_, are in the same segment, where —1 < @ < 1, and the a,’s 

are iid. N(O,1). A partial likelihood can be set up based on the conditional 

densities f,(z,|z,-,) of those z, whose predecessor z,_, is also observed. Then 


1 (0) = -+ £ (2;> bz), 


2 j t=n,+1 
J my J m, 
fay Ea, 22 2 
Uy=- kL È agzı and Ww= kh È zhi 
J=l t=n,+1 Jol t=n,+1 


The conditions for consistency and asymptotic normality of the MLE ĝ® based 
on the partial likelihood are easy to verify if N~!V,, converges to some positive 
constant. For simplicity consider the regular case when the length of each 
ee segment is m,—n,;=k, and the length of each missing segment is 
n,-—m,_,=1. We now discuss three different models for the missing values, 
leadin to different comparisons of the partial likelihood MLE 6‘) to the full 
MLE @®. 

(i) The whole series z,,...,Z, follows an AR(1) model: in this case clearly 
N Vy >p (k/(k + DX1/0 — 87)) as N > œ; on the other hand, the full 
information in the complete data z,,..., zy is 1/(1 — 07). Hence the asymptotic 
efficiency of 6 is bounded from below by k/(k + 1); this bound is close to 1 if k 
is much larger than 1. 

When 1/k is not negligible, the information lost by using only the partial 
likelihood is contained in the conditional densities of Zn, given Zm, J=2,...,d. 
(We are ignoring z„, but this does not affect the asymptotics.) Write Zn, a8 
=a* +62 , 


z 
n, n, Mmj- 


where 


ak = a, + 6a, +--+ +6 an 41 is N(0, 0?) 
with 

8? = (1 - 6*)/( — 8°). 
Let 1”) = log f(z,,|2,,) and of” = Varl (0/30 )L )olZm, ]. By direct calcula- 
tion, the (normalized) fafoemation contained in LẸ? is 








1 1 1 {@07\? 
No? (w) 29 %i-1) Pa P| 
Le EN at l] 
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Thus 
(asymp. efficiency of 6°) 
(6.1) : k/(1 - 6°) 
~ R/( = 62) + [0262-9 (1 — 674) + (1/264)( 302/00)" 


Note that as l> œ, 1/204(d07/d0)* > 67/(1 — 67)? and hence (asymp. 
efficiency of 6*)) > k(k + (82/(1 — 67)))~!. This limit is close to 0 if @ is close 
to 1, a result quite contrary to intuition. This seeming contradiction is due to the 
oversimplifying assumption that the variance of the random errors are known to 
be 1. In fact, if y = Var(a,,) is introduced as a parameter, then the last term in 
the denominator of the efficiency expression (6.1) disappears, giving the limit 1 
for the efficiency as 1 > œ. 

This is the only context where the assumption of constant error variance 
makes a qualitative difference; in the following discussion we will continue to 
assume constant variance for the sake of simplicity. 

Gi) Incidental parameters: Suppose that at the end of each observed segment, 
the level of the series is shifted by an unknown amount ,, causing the series to 
be unobserved in the next 7 subsequent units of time. That is, we assume 
Zm, +1 = OCB, + Zm ) + amp and that the rest of the series follows an AR(1) 
model. Thus, each of the conditional densities f(z, lZm, _,) involves a different 
incidental parameter p,. We will assume also that the sequence By, H2... has 
enough regularity so that 
(6.2) P(8, < N-'Vp < 8,) +1 for some constants ô., 5, > 0. 


It is clear that some regularity assumptions for the p,8 are necessary for 
discussion of asymptotics; condition (6.2) is, in fact, quite mild, being satisfied, for 
example, if the »,8 are uniformly bounded. Under this condition, 6 is con- 
sistent for 6; in contrast, the MLE 6 turns out to be an inconsistent estimator 
for 6, a result we now proceed to establish. Using the fact that z, = až 


Rnj+1 


Op, + 2m) where an, _isa NỌ, o?) random variable as defined in ( (i), we obtain 
a 1 “ao? 
wo) | @~2 -2 _ pt 79l- 
TM = A ag Pe eae 6 (1, + Zm,))( 1a’ (p+ Zm, )) 


2 


+ -aE Wena ~ Op, + Zm )) ; 





Ê pon = Li — b'(p;+z )) 
On, 4 o? njas Bj m,/}* 
The value of p, that makes (0/du,)1\” zero will also make the second and third 
terms of (0/00)1\”) vanish. Recalling the form of the likelihood equations (5.9), 
we see that the MLE 6 must satisfy the equation 
ILH J ðo? 


EDE EA L A E 
(6:3) 902° g TO 
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It follows from an easy calculation that the solution 6 of (6.3) satisfies 
1 1 1 3 do? 

2k+1 7, (° 6 Is Poh 

Thus, if (o~?(902/80))g_9. # 0 then it is impossible for 6{7) to converge to by. 
Finally, it can be checked that for J > 2 the only real roots of the equation 
o (07/08) = 0 are 0 and 1. 

Gii) Random level shifts: When there are many nuisance parameters we may 
try to model them also. In the present example, we will illustrate the efficiency 
calculation for the simple model in which the level shifts are random variables 
with a common unknown density, independent of each other and independent of 
the a,s. The appropriateness of such a model, of course, depends on the 
particular application and should be, just like the original autoregressive assump- 
tion, subjected to careful scrutiny. 

The above model is equivalent to 
(6.4) 2, = Afa EO Zm, tU, 


Myst Riti 


6 = bo = 


where U,’s are i.i.d. random variables with a common unknown density g, with 
respect to a given finite measure p, and are independent of the errors a,,’s. The 
N(0, o?) random variable a* is as defined in (i). 

Suppose that the true density Zo is positive a.e. (p). Then without loss of 
generality we can take u to be the measure induced by gp, and take gy = 1. We 
will consider A = ya as our nuisance parameter and consider as the nuisance 
parameter space, I = {h: fh? du = 1}. Let hy, ho,... besuch that {1, h,, Ras... } 
is an orthonormal basis in L?(u), and consider the sequence of parameterizations: 


r -1 r r -1/2 r 
it Enh, [ $ Enh) = £ + Evi) 1+ Enh). 
1 i 1 1 
Here ||*]| denotes the norm in L?(p), and we use <», +) to denote the corre- 
sponding inner product. To verify the crucial condition (5.6), it suffices to make 
the following elementary observations: 
(a) the tangent space (of T) at Ay = lis J= {h: (h,1) = 0}, 
(b) the ith partial derivative of h(+|n) at hy is simply h, i = 1,2,..., 
(c) {h,, Ag,...} forms an orthonormal basis of 7. 


Therefore, we can proceed to calculate the minimal Fisher information by the 
method of Section 5.2. By (6.4), 


L”)=log F(Zn,.,I2m,) = tog| f(2n,. * O'Zm, - u)g(u) du(u)|, 


(6.5) A(-In) = 

















1 2 2 
k (u) = -zat 
(u) Vea 
Let y, =z,  — 6'2,, = a* + U. Then by direct calculation, 
J byt} by by J 


aus) 
(6.6) | Ff | =at Zm, fi) +a fol y,), 
0 
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where 
1 da B 1 a 1 ðo? 
0G 805” ay = 510 , va aia 
po- Lawlor d a 
fka 3, — u) du(u) ™ 
and 





r 


ae 2(Gh,)(Y, 

an o = ( )( ae 
where G is the integral operator defined by 

n)(y) = P2 ww) dulu) 

[RAY 7 u) du(u) 
These and many expressions derived below can be given a probabilistic interpre- 
tation if we introduce three abstract random variables Z, U, and Y, where Z 
denotes a r.v. having a distribution equal to the marginal distribution of z,, 
U denotes a r.v. having density El), and Y denotes a r.v. such that the 
distribution of Y given U = u is N(u, oĉ). It is also assumed that Z is indepen- 


dent of (U, Y). The operator G can now be interpreted as a conditional expecta- 
tion (Gh)(y) = E(A(U)|Y = y). Furthermore, it can be easily checked that 


J a 
J~! $, Var | 2 | 
jai 30 Jo 


> af E(Z?)Varl fi(Y¥)) + 2a, E(Z)Cov( f,(¥), }(Y)) + a3Var( f(Y)), 


au faga 
00 Jo an, J 





z 
m; 











J 

J- È Cov 
ma Cov[a, E(Z)f,(Y) + a f.(¥),(Gh,)(Y)]. 

The first limit is the variance of 

(6.7) [a,(Var( £,(¥)))'7Z] + iY), where f,(¥) = a E(Z) AY) + a2 h(Y). 


Thus the marginal information H, , for 9 contained in LẸ is obtained by 
setting (k + 1)H,., = the residual variance in the regression of the random 
variable (6.7) on the variables (Gh, XY), i = 1,...,7. 

Since Z is independent of Y it is clear that H, , cannot be made to vanish by 
increasing r. We now show that the variance of the second term f,(Y) in (6.7) can 
be explained arbitrarily well by increasing the number of explanatory variables 
in the regression. Consider the Hilbert spaces H(Y) = {f(Y): Ef Y) < œ}, 
and H(U) = {h(U): Eh*(U) < œ}. It is clear that f{Y) € H(Y). To prove 
the result, it suffices to show that /,{Y) is in the closure of the subspace 
of H(Y) spanned by {1,Gh,,Gh,,...}. Since {1, hi, hg,...} is a basis of 
H(U) and G1 = 1, it is enough to show that G: H(U)- H(Y) is an iso 
morphism. The continuity of G follows from the inequality E{{(GA\Y)]?} = 





m 
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E{{E(A(U)|Y)]?} < Eh?(U) < co. That this map is 1-1 can be seen by checking 
that 


(Gh)(y) = 0 Vy fk (y — u)go(u)a(u) du = 0 Vy h(U)=0 ae. 


That this map is also onto can be seen by the following argument: if f(Y) 1 
Gh(Y) for all h € H(Y), then we have E(A(U)f(Y)) = 0 for all h; hence 
E( f(Y)\U = u) = 0 for all u € support (g,); it then follows from the complete- 
ness of the normal family that f(Y) = 0 

Thus the minimal Fisher information contained in the supplementary partial 
likelihood L‘) is a?Var(f,(Y))Var(Z). Can this information for @ really be 
utilized? The following argument convinces us that there must be some way to 
make use of at least part of this information. Consider the simplest case / = 1; 
then z, ., = 9z,, +, where e, =a, , + U, Suppose Zm can take only two 
values, say 8 and’ — ô. Then conditional on the Zm 8, the observations Z,8 havea 
two-sample location shift structure. Though the ‘density for e, is unknown, it is 
possible to estimate the shift 288. The larger the value of 8 [and thus Var(Z)], 
the easier it is to estimate 9. In our model the z,,’s can of course take values 
other than +6, but even if we throw away all the information in f(z, ty +1/Zm,) 
except for those z,, ’s with values close to +ô, the above argument implies that 
we still can use the remaining ones to estimate 8. 

How can one make use of this information? A natural approach might be some 
kind of adaptive estimation involving the estimation of the mixture density g. 
However, such a method would be very complicated and its properties are largely 
unknown. The loss of information, a?Var(/,(Y))Var(Z), serves as a guide in 
choosing between the partial likelihood or more complicated methods. Note that 
a, decreases exponentially as the length / of the missing sequences increases. 
Thus if @ is not too close to 1 and / is large, we can be sure that the partial 
likelihood is nearly fully informative. 


REMARK. There is a special structure in this example which, at least heuristi- 
cally, allows an easier computation of the minimal Fisher information. For fixed 8 
and 2m, Y= Za, — 92m, is a complete sufficient statistics for the unknown 
shift u,. In the above we have, by exploiting this completeness and sufficiency, 
essentially showed that the affine subspace generated by the scores of the 
nuisance parameters, (d1\”)/dy,)), i= 1,2,..., is the same as the subspace 
generated by L? functions of y; Since projection to this subspace is the same as 
conditional expectation, the minimal Fisher information can be calculated more 
easily in the following way: 

(a) calculate the “conditional 9-score” s, — E(s,|y;) where s, denotes the 
6-score given in (6.6), i.e., 


E(8,1y,) = 4 (2m, — E(Zn,)) AC); 
(b) calculate the “conditional score information,” 
i, = Var(s;— E(s/13,)) 
= aj Var(Z)Var( 7,(Y)). 
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This gives the same value as the minimal Fisher information derived above. The 
approach of studying information by conditioning on a complete sufficient 
statistics (which may depend on @) of the incidental parameter is studied in an 
insightful paper of Lindsay (1983). The definition of the conditional score 
information (7,) as given in that paper is not entirely correct—being defined in 
terms of the density conditional on the incidental parameter (rather than the 
density with the incidental parameter integrated out); which, in the context of 
the current example, leads to i, = a? Var(Z)E(Y — U)’, a value always too large 
as compared to that given above in (b), namely, a?Var(Z)E(Y — E(U\|Y))’. 
Nevertheless, Lindsay’s main theorem which asserts that in the ii. exponential 
family mixture setting i, is the same as the minimal Fisher information, seems to 
be correct after appropriate modification on the definition of ¿t The above 
example indicates that Lindsay’s theorem might be expected to hold in more 
general settings. 


6.2. Proportional hazard model. As another illustration of the general the- 
ory, we treat the proportional hazard model under the assumption of no censor- 
ing. In this case the information missed by the partial likelihood LY) is 
contained in the conditional distribution of ¢, given d, = (t,_,,R;,) where 
R; =R,, =risk set after t,_,. Now, given d,, t, is a random variable with 
hazard function A,(t)b,(@) for t> t,- where 6,(@) = Lar wl), z) = 
(N-n+ DÈ, e z Pa(2)w(0, z). Thus, 


I) = log f(t,|d,) = logA,(¢,) + log b,(8) — f" Ao(t)b,(9) dt, 


n—i 


and 





al) ty 
| T | =4,,-4,,0* where T,* = f Aoolt) Ono) dt, 
t jo 


n-i 


(6.8) 
a 
Ezez Prlz) zg M60» 2) 


= me |p = 
ji & Ta E, ezPa(z)w(0o, z) 


Here ) (+) denotes the true value of the base line hazard Àg(°). 
Embedding A ,(¢) in the parametric family A(t) = Aoo(é) exp{Z7_17,.8,()}, we 
have 


ð P, 
Fy hol) = A ool tjem O® . g(t), 
K 





al) 
(6.9) | Z 


an, | = g@(t,) — J,” NoCD CE) dt 


= -g(t ie ~ 1) + O,( ty tai) 
From the distribution of t, given d,,, the distribution of T,* conditional on d, is 
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seen to be an exponential distribution with expectation 1; using this fact, we have 


& fa es i 
V f = apn 
30, 3 a6, : n nj 
| a) | =| 
Cov | ——| , -= da | = @,,8,(t,-1); 
30, jo \ on, i J 


E E w E. 
Cov| | ——] ,] =~] id, | = 8, (tn-1)8;(tr-1) 
ðn, Ja \ Oy ja d Bere 


We now discuss the asymptotic behavior of these conditional covariances as 
N > œ, n> œ, n/N > x € (0,1) rigorous proof for every step will not be 
provided, but sufficient details are given so that the limiting values can be 
obtained numerically in each model specification. The limiting value of a,, 
depends on that of p,(z). Hence, 























2 ə 
E, ezħ(x) zy woo, z) 


Lrezh,(x)w(4%, z) 
where h, is obtained by solving an O.D.E. as given in Appendix A.2. To 
investigate the asymptotic behavior of ¢,, note that from E(7*,,|d,,,) = 1, we 
have 


(6.10) a,,74,(x) = asn/N > x, 


1 1 
E[Xcol(ta)(tns1 7 t,)|dn+1| = Ni _ x) E ezh,(x)w(%, z) $ 
Suppose ¿(x) is the limiting value of ¢, as n/N > x, 
A(t) = [Aoo(s) ds and y(x) = A(z), 


then the sae expression T that y satisfies the O.D.E. 


(6.11) (2) = =a ee eR TAO 


Thus a(x) = ae can always be solved numerically for any specified w and 

À oo; under obvious conditions, ix) is a strictly increasing function of x, Kx) > œ 

if x — 1. If we define 2,(x) = g,(i(x)), then g,(t,_,) > &,(x) as n/N > x. 
a returning to ‘the conditional covariance, we see that 


auf ai 

N x 09 te a6, R 38 

N aai T 

1 Š atay fao 
who sles LF 


14 D) | a 
i foo), 





with y(0) = 0. 





1 
ke TA fadx)a(x) dx = at j» 








Ja- > f'a (2)Z (x) de = aet H, pen 











J- > farsa, (x) dx = Hg+, prj" 
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It is easy to see that under suitable conditions the functions a,(-), i = 1,..., p, 
are square integrable functions, i.e., the regression of (831/30), on (41° /d4), 
is equivalent to a projection in the Hilbert space L?[0,1] of a,,...,a@, on the 
subspace spanned by @,,..., 8,. Since the g,’s are arbitrary, the %,’s are also 
arbitrary, and hence the residual variance can be made arbitrarily small by 
choosing g,’s appropriately. In the natural Cox model, Efron (1977), and Oakes 
(1977) have given formulae for the efficiency of the partial likelihood. But the 
calculations from these formulae can become very complicated in special cases. In 
the theory presented here, the parameterization of the base line hazard is general 
enough to include the special cases studied by the above authors. Explicit steps 
are provided for the numerical computation of the matrix of loss of information. 
The computation is straightforward if an O.D.E. solver is available. The above 
discussion also appears to be the first systematic account of efficiency calculation 
for the case of the general form of the relative risk. However, further effort is 
needed to remove the present restriction of no censoring. 


APPENDIX 


A.l. Ergodicity of some generalized AR processes. For the Bernoulli 
case, the transition matrix for {Y,} is 











(0,1) (1,0) (1,1) 
% 0 0 
0 1 — (6) + 4) (4) + 8) 
1— (+8) (8+ 4) 0 0 
0 1— (6+6,+ 6) (0+ 6, + 4) 


The region of ergodicity is clearly determined by requiring all eight nonzero 
entries in the matrix to be strictly positive. This defines a nonempty polygon in 
R?. 
For the Poisson case, let p„ = (1 — e7™) = E(x,|X£n-1---), and 
a(x, yin, m) a P(x, =X, X, 1 = yx,- = n, X, 2 = m), 
then 
sa a 
x! 
= §,,exp| - (1 e e~ (040194 82m) )] 


x(1 a expl — (6o +0y+ 6,m)|)*/x!, 


where 6,,, is the Kronecker delta symbol. If 6 > 0, it is clear that all states are 
reachable from each other. Hence the chain is “irreducible. It is clearly aperiodic. 
To see that it is ergodic, let g?(x, y|n, m) be the two-step transition matrix. 


g(x, yim, n) = „e 
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Then g7((0,0)(, m)) > e > 0 for some e > 0. Hence 
P( xq, = 0, Xq,-1 = O£- = N, Xoy-1)-1 = M) > E, 
80, 
P (frst return to (0,0) > 27 + 1) 
< Poo((x2. x1) # (0,0), (x4, x3) # (0,0) -+ (Xon Xon—1) # (0,0)) 
< (1 - e)”. 
It follows that mean recurrence time to (0,0) is finite, and the chain is ergodic. 


A.2. Stochastic development of proportional hazard systems. The 
model is described in Section 3.3. First consider the case when there is no 
censoring, then conditional on the death times ¢,,),..., Éw) the stochastic evolu- 
tion of the system is equivalent to the following: a population of N individuals is 
sequentially sampled, at each draw the probability for any individual to be 
selected is proportional to the weight w(,z) where z is the explanatory 
variable value associated with that individual, the selected individual is then 
removed before the next draw. Throughout we suppose that the set of possible 
values of the explanatory variable is finite, i.e., Z = (z™,..., 2}. An individual 
will be called a type k individual if the associated explanatory variable value is 
z), Let w, = w(6y, 2) and 


(A.1) x, = k if the individual selected at the nth draw is of type k, 

(A.2) MY) = the number of type k individuals just before the nth draw, 

(A.3) AGP = MO)7(N — n+ 1) = proportion of type k before the nth draw, 
nk nk 


(A4) gi = wh / ( E wa) = Pla, = klena teas): 
jel 

The question is: With the initial proportions qg,,...,¢x of the K types of 
individuals fixed and the population size N — oo, is there any nontrivial limiting 
behavior in the evolution of the system? If so, how to calculate the limits? 

A simple simulation showed that the quantities that become stable are the 
values of ACY and g when n increases with N in such a way that n/N > t. 
If we define the random vector functions A“ (+) and g("(-) by 


AW ift=n/N 
A.5 h(t) at a fork = 1,..., K 
a K (e) ee interpolate otherwise i a 
K 
(A.6) a(t) = wo || L WAMO), 
jul 


the simulation result suggested that there is a function A(-) such that, as 
N > œ, A\%(t) >ph(t) for allO < t < 1. 

How does the limiting function h(+) depend on the sampling weights w,’s and 
the initial proportions q,’s, and how can one actually calculate it? We will only 
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briefly indicate the resolution to these questions here: from (A.1)—-(A.4) it is clear 
that 














n+i1 1 K 
h N | Fie, [a = Ë bulanan] 
A.7 
Gey) mir N-n+1 1 K 
= hy (=| . N-n 7 N- pt Prrlanan 
n+1 n 

2] | R ) — nr{ =| ee 

(A.8) 


1 1 
= W Tny MOON) - ah(n/)). 


Writing t = n/N, At = 1/N, and passing to the limit as N —> œ in (A.8), we 
obtain 
(A.9) h(t + At) — h(t) = At- (A(t) — g) - t). 


In other words, to determine A(t), it is only necessary to solve the system of 
ordinary differential equations: 








d D anie x. 
(A.10) gO = gO 20), 
with 
— K — 
Z(t) = nästa | L wh (t) 


and 
h,(0) =q, (initial values). 


For example, suppose the initial population size is N = 5120, divided equally 
into two types, and the sampling weights are w,: u = 4:1. A computer Monte 
Carlo experiment of 200 replications is performed. In each replication, we record 
the proportion A(0.4) of the type 1 individuals among those still surviving just 
before the 2048th death (5120 x 0.4 = 2048). The mean of these 200 proportions 
is 0.33592 and the SD is 0.0055. The normal score plot is given in Figure A.1 
below; there is clearly no evidence of nonnormality. By solving the O.D.E. (A.10) 
for h, we obtain (0.4) = 0.33597. Using the h solved from (A.10), we solve (A.11) 
to get 5 = Var[ YN (h — h)]; at t= 0.4, we obtain ọ = 0.1491, giving SeT 
0.0054. In fact, much more is true: for any A < 1, BUpo < s< alA (t) — A(t)| >p 
furthermore, for any ¿< 1, YN N (AME) — A(t)) is asymptotically normal in 
well-determined variances and covariances. For example, in the simplest case, 
K = 2, if 0(t) = Var[¥N(AMt) — A,(t)] and S(t) = L?w,h,(t), then 

d_ 26 a 7 | z 
(A.11) di i pal w) + E(w, Wy) + Gx yaC Z) 
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Fic. A.l. Normal score plots of h(0.4) for 200 replications. 


with 0(0) = 0. The rigorous proofs are not of central interest in this paper and are 
omitted. These results are in excellent agreement with those from simulation. 

To extend these results to the case when censoring is possible, let N, = total 
number of deaths, N, = total number of censors, then N = Np + N, = population 
size. Let MOO, ROP, gO), A+), gO(-) be as before, in addition, let 10) = 
number of censors during the interval (¢,,,), f(,41), i.e, between the nth and the 
(n + 1th deaths, and LG) = Er UP. Assume that, as N > œ, Np/N = 
proportion of deaths —> 1 — a, and as n/Np > x, the proportion of various types 
of censors stabilizes, i.e., 


(A.12) L4)/N, > some limit y,(x), as n/Np > x. 


Then by the same reasoning that led to (A.10), we can derive the ordinary 
differential equation for h: 


= =ar PER 
(A.13) h= ere aa —a)(h—- g),+ a(hy'.— v4), 


where y= D*y,. 
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(WITH LEAST-SQUARES ESTIMATORS} 
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For an arbitrary one parameter exponential family density it is shown 
how to construct a mixing distribution (prior) on the parameter in such a way 
that the resulting mixture distribution is a two (or more) parameter exponen- 
tial family. Reweighted infinitely divisible distributions are shown to be the 
parametric mixing distributions for which this occurs. As an illustration 
conditions are given under which a parametric mixture of negative exponen- 
tials is ın the exponential family. Properties of the posterior are given, 
including linearity of the posterior mean in the natural parameter. For the 
discrete case a class of simply-computed yet fully-efficient least-squares esti- 
mators is given. A Poisson example is used to demonstrate the strengths and 
weaknesses of the approach. 


1. Introduction. The problem is this: suppose one is given an exponential 
family of densities for random variable X which have the form 
(1.1) f(x; r) = e0, 7 EQ, 
with respect to an arbitrary sigma-finite measure u on the real line. How can one 
then create a parametric family of mixing distributions (or equivalently, priors) 
on r so that the resulting mixture density (the marginal for X) is still of 
exponential form? That is, we seek a parametric family Q(-; a, 8) of distributions 
on the natural parameter space Q such that for some functions ¢(x) and k(a, B) 
one has 


(1.2) f(x; a, B) = ant | F(x3 1) dQ(1; a, B) = e% + BX2)— Ka, B), 


The solution to this problem is given in Theorem 2.1; the eligible class of 
functions t(*) are simply cumulant generating functions for infinitely divisible 
distributions. Section 2 of this paper provides a description of this class. 

The posterior distributions corresponding to the priors of this paper will be 
shown to be the exponential family tilts of the corresponding infinitely-divisible 
distribution, where the exponential family tilt of the distribution F(x) is the 
parametric family of distributions defined by 

Ox 


dF, (x) = 7 dF(x). 


e™ dF(x) 
Although the main thrust of this paper is to develop some understanding of 
the structure given to the mixture problem by (1.2), attention will also be given 
to modelling and estimation in Sections 3 and 4. In particular, weighted least- 
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squares estimators of the parameters (a, 8) in (1.2) will be given, together with a 
proof of their full efficiency in finite discrete models. Falling outside the domain 
of this paper are densities f(x; 9,7) involving other “unmixed” parameters 8. 
Lindsay (1985) demonstrates that the method of construction of this paper can be 
useful in such models. 

Before proceeding to the solution, let us consider its ramifications, for which 
some background and context is essential. Mixture densities of the form 
f(x; 7) dQ(r) arise in a number of important settings. For example, suppose that 
for i = 1 to n each random variable X, is an independent observation from a 
stratum i which has an associated parameter 7,, but since the strata are sampled 
from a population of strata, the 7, are themselves random variables from a 
distribution Q. In the usual normal linear models theory, this generates the 
one-way random-effects model, with Q generally being restricted to being normal 
itself. Another way the mixture model arises is as a natural model for overdisper- 
sion (“heavy-tails”) relative to the basic density f(x; r). This is perhaps most 
dramatically evidenced by the convex shape, as a function of x, of the logarithm 
of the ratio of the mixed density ff dQ to the basic density f, a property which 
will be utilized later in the paper. See Shaked (1980) and Schweder (1982) for 
further results regarding the relative dispersion of mixtures. 

One motivation for considering new methods for generating mixing/prior 
distributions is the often awkward nature of the marginal distributions for X 
when the standard conjugate family is used. Although conjugate families have 
many attractive features in the Bayesian mode of inference [cf. Diaconis and 
Yivisaker (1979)] even there it seems extremely limiting to have just one practical 
family of priors. Dalal and Hall (1983) consider discrete mixtures of conjugates as 
a method of increasing flexibility. Although the methods introduced here might 
also be useful in this regard, their main virtues are necessarily frequentist. From 
the Bayesian perspective perhaps the most interesting development is that this 
theory yields a class of priors for which the posterior mean of the natural 
parameter 7 is linear in the prior’s parameters [see (1.4)]. Another useful feature 
is that the posterior distributions are generally quite simple. 

Consider the possible advantages of an exponential family mixture model in a 
frequentist setting. Suppose that one has a distribution Q so that (1.2) holds. 
Observe that $ = 0 corresponds to the original unmixed model, with a = r. It 
will be seen in the construction of Q that B > 0 will correspond to the presence of 
mixing, while £ < 0 represents “underdispersion” relative to the basic model. 
Thus there will exist a uniformly-most-powerful unbiased test for the presence of 
the random effect (£ > 0) against the null hypothesis of no effect (8 = 0) based 
on the conditional distribution of (X) given X = x, or, with a random sample, 
Le( X,) given LX, = x. 

A second advantage to this family of mixing densities is that when there is a 
choice of the function (x) via selection of Q, there is flexibility in the choice of 
the mixed model. (It will be shown that the range of choices for ¢(-) are limited 
by an integrability constraint.) Furthermore, since the mixing effect shows up in 
the value of an observable, t(x), simple graphical techniques can be useful both in 
the choice of t(x) and considering the overall validity of the mixture model. 
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To confirm this last point, note that 
log! f(x; a, B)/f(x,7)] =a + (a—1)x + Bt(x) 


for a = —k(a, $) + k(t). Suppose the x-sample space is discrete. Denote by 
n(x) the number of observations of x in a random sample of size n and by 7 the 
maximum likelihood estimator of 7 in the basic density (1.1). Let A(x) = n(x)/n. 
Then 


(1.3) r(x) =gelog| A(x) /f(x;7)] > as.at+bx+ct(x) asn -> 0, 


for parametrically determined values a, b, c. For a sufficiently large sample size a 
graph of (x, r(x)) is thus diagnostic for the form of t(-). (A Poisson example will 
be given in Section 3.) In (1.3) we have used the maximum likelihood estimator of 
7 to standardize r(x) as the graph then indicates departures in fit from the 
no-mixing model, with r(x) near zero indicating a good fit at value x. In this 
regard note that 2¢r(x)n(x)/n is the likelihood ratio goodness-of-fit statistic for 
testing the adequacy of the basic model containing no-mixing against an arbi- 
trary multinomial alternative. More generally, with adequate data in a continu- 
ous model one possibility is to group the sample space into intervals, as in the 
chi-square goodness-of-fit test, and use a suitably redefined version of (x, r(x)) 
for diagnostics for the form of t(x) and the validity of the mixture model. 

Along this line, note that (1.3) suggests estimation of parameters by least 
squares. In Section 4 it is shown that fully efficient estimators can be so derived 
in discrete models. 

A third useful feature of this class is the linearity of the posterior mean of the 
natural parameter. Lindsay (1985) has given a class of mixture problems where 
the linearity of the posterior mean yields a direct solution to an optimality 
problem. The linearity of the posterior mean and the form of the posterior 
variance follow from these easily derived relationships valid for exponential 
family mixtures: 


D,log f f(x; 7) dQ(r) = E[ lx], 
Dpiog f f(x; +) dQ(1t) = Var[ rix]. 
These show that for density (1.2) 


E[c|x] =a + Bt'(x), 
oe) Var[ rx] = Bt’(x). 


(1.4) 


In contrast with (1.5) under the conjugate prior it is the posterior mean of the 
mean value parameter k’(r) = E,[x] which is linear in the data x [see Diaconis 
and Ylvisaker (1979)]. 

Despite their virtues the methods of mixture modelling discussed herein are 
not statistical panaceas. For a start, numerical integration or summation will 
typically be necessary for maximum likelihood estimation. A second difficulty is 
that in a more complex model involving several parameters one cannot turn the 
sampling variation of one parameter 7 into a “random effect” of the form 
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considered here unless the model fits rather narrow structural constraints. Thirdly, 
the eligible class of mixing distributions will generally change with the sample 
size n of each stratum. Finally, the priors are unconventional. Indeed, the most 
important aspect of the proposed models may not be their practicality but the 
insight gained by turning the presence of mixing into an observable phenomenon 
through the statistics (x). 


2. The reweighted infinitely divisible densities. In this section the basic 
relationships between infinitely-divisible distributions and exponential family 
mixture models are established. First let P, be a family of infinitely divisible 
distributions, with £ a positive-valued parameter defined by the moment gener- 
ating function (m.g.f.) relationship 


(2.1) f e™ dP,(r) = eX), 


Note that 1(0) = 0. We assume £x) is finite on the sample space of X. Important 
examples of this relationship are given by 





Distribution t(x) 
Normal (0, 8) x?/2 
Poisson (£) exp(x) — 1 
Gamma (f, 1) —logil — x) 


For k(r) defined from (1.1) by k(r) = logfexp(rx) du(x), we define the k- 
reweighted distributions P* by the relationship 


(2.2) P(r) = e(8)e*™® dP(7), 


where c(°) is the necessary standardizing function; we assume, of course, that the 
defining integral is finite, else the distribution is not defined. The following 
theorem establishes the main result. 


THEOREM 2.1. If dP is defined as in (2.2), then 
(2.3) fer dP#(7) = c( Beh), 


Conversely, suppose (2.3) holds for some function t(-) on a set of x-values which 
has a point of accumulation and which includes 0, on an interval of B-values 
with left endpoint 0. Then (2.2) holds for a family of infinitely divisible distribu- 
tions with log m.g.f. B(t(x) — £(0)). 


Proor. Integrating the left side of (2.2) gives (2.3) directly. For the converse, 
we define from P% a family of probability measures 
dP,(7) = d(B)e"*™ dP#(r), 


where d( £) is the standardizing constant. Note that we are just standardizing by 
the inverse of the value of the density (2.3) at x = 0. Hence (2.3) may be 
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rewritten as 
(2.4) fe” dP,(7) = ef- 0, 


If this relationship holds for a set of x-values with a point of accumulation, then 
the right-hand side of (2.4) completely determines the distribution P,. It follows 
that the distribution of P, is the n-fold convolution of the distribution Ë, Jas 
Hence P, is infinitely divisible and A(¢(x) — t(0)) is the log m.g.. for it. 0 


In the next proposition the families of infinitely divisible distributions are 
extended by several parameters. 


PROPOSITION 2.2. (a) Suppose that Bi(x) is the logm.g.f. of an infinitely 

divisible family of probability distributions P}. Then 
t*(x) = ax + B[t(ox + 6) — ¢(6)] 

is also the logm.g.f. for a family P of infinitely divisible distributions, for 
arbitrary choice of real parameters a, o, and 0, subject to t(@) < oo. If the 
support of P, is contained in a half infinite interval, then P* has support shifted 
to the right by a. 

(b) If B,t,(x),..., B,t,(x) are each the logm.gf. for a family of infinitely 
divisible distributions, then so is B t(x) + -+> +B,t,(x). 


Proor. First, we construct by exponential family tilt the density 


(2.5) dP, (7) = c(B, 6)e% dP,(7) = e?7— BUG) aP,(r). 
This distribution has the m.g.f. in x: 
(2.6) eA(Hz+9)—(8)) 


If Y has this m.g.f., then oY + a has the logm.g.f. ¢*(x) specified in the lemma, 
part (a). 

Part (b) simply indicates the closure of infinitely divislbe distributions under 
convolutions. 0 


In Table 1 Proposition 2.2(a) is used to generate several classes of functions 
t(+). Using part (b) one can create multiparameter exponential family mixture 
models. In regard to computing the posterior distributions on r, notice that the 
joint density of (7, x) is proportional to 


em Arg k(t) aP,(7) 


so that for each £ the family of distributions in parameter x is the exponential 
family tilt of the distribution P}. If ¢,(s) is the moment generating function for 
dP,, then the m.g.f. for the posterior given x is o,(s + +)/,(x). Since the prior 
is the expectation of the posterior over the marginal distribution of X, the priors 
corresponding to the families in Table 1 are mixtures of normals, Poissons, and 
gammas, respectively. 
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TABLE 1 
Three important infinitely dwistble distributions and the corresponding mixture structure 





Parameter 
Infinitely space Q 
divisible must Posterior Posterior 
distribution t(x) contain Posterior Mean Variance 
Nia, a?) ax + a?x?/2 (-0,0) N(a+o%x, 0°) at o*x o? 
a + aY where ax + ß(e™-— 1l) oR*+a at+oY a+oBe™ a*Be* 
Y ~ PA£) Y ~ Po(Be™*) 
À oB B 
a + oY where ax+Plog-——— oRtt+a ataY a+ 
(A - ax) A~ox (- ox)" 
Y ~ G(B, A} (ox <A) Y ~ G(B, A — ox) 
and 
om 4] 


Since all infinitely divisible distributions are the weak-convergence limits of 
convolutions of generalized Poisson distributions one might consider 


(27) t(x) = ax + } (e° — 1) 


to be a general format for the class of models. This is correct when the X-sample 
space is bounded. However, the constraint that f{ exp(t(x)) du(x) < oo otherwise 
provides some (possibly severe) limitations. Indeed, on an infinite sample space 
the possibility of generating an exponential family that models overdispersion 
may be considerably reduced by the integrability constraint. The following 
proposition, a direct application of Fubini’s Theorem, gives us an alternative test 
for the eligibility of families {P,}. 


PROPOSITION 2.3. 


fer du(x) < co ifandonly if fe” dP,(1) < co. 


EXAMPLES. From Proposition 2.3 for the normal (9,1) distribution one has 
the requirement on P, that 


fexp(+7/2) dP,(7) < œ. 


From this it is clear that the tails of dP,(7) must decline faster than exp(—17/2), 
and so no generalized Poisson will have a convergent integral. We note, however, 
that in some problems a truncation of the sample space may not be unreasonable, 
in which case there is no difficulty with convergence for the corresponding 
truncated exponential family. 

For the negative exponential density, rexp(—7x), Proposition 2.3 yields the 
requirement that ft! dP,(7) < oo. Since this will hold whenever there is no 
support at 0, correspondingly there is a rich class of exponential class mixture 
models, with an interesting contrast to the arbitrary mixture density. 
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PROPOSITION 2.4. Let f(x; 7) = rte`™ for x20, +> 0. Then if f(x) is a 
density on x = 0, we have 

(a) f(x) is a mixture of f(x, rt) densities if and only if f(x) is completely 
monotone. 

(b) If f(x) = e(B)exp(—ByY(x)) for B > 0 is a mixture of f(x; t) densities with 
FO) < œ then (x) = D,[—log f(x)] (the posterior mean of 1) is completely 
monotone. 

(c) Conversely, if (x) has a completely monotone derivative ’ on (0, œ), 
then there exists an exponential mixture density 


f(x; a, B) = c(a, B)exp(—ax — By(x)) = f f(x; 7) dPX g(r) 
fora> 0, B > 0. 


PrRooF. (a) Feller (1971), page 464, # 11. 
(b) Follows from Feller (1971), page 450, Theorem 1. O 


3. Exploratory analysis. As mentioned in Section 1, the form of the 
exponential family model suggests that an exploration of the mixture structure of 
a data set might potentially be conducted by analysis of the logarithmic residuals 


r(x) = log[ p(x)/f(x; #)]. 
In this section a data set is used to verify this idea, with emphasis on two 
attributes: (a) a graphical analysis illustrates important aspects of structure and 
(2) the problem of finding an appropriate mixture model is similar to that of 
finding an appropriate higher order term in a regression model. The example will 
also serve to illustrate several potential limitations on this simple approach which 
arise from the unbounded nature of the sample space. 


Log residuals 


an 
fitted models 





0 l 2 3 4 5 6 7x 


Fia.1. Graph of logarithmic residuals r(x) and three fitted models discussed wn text. 
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TABLE 2 
Observed and estumated counts of the number of accidents 
for 9461 drivers 

Observed Negative 
x count nf, nf, binomial 
0 7840 7846.81 7852.20 7846.93 
1 1317 1298.18 1265.93 1288.48 
2 239 238.60 259.42 256.52 
3 42 54.44 §2.41 54.05 
4 14 15.12 15.88 11.70 
5 4 4.90 3.97 2.58 
6 4 1.77 0.95 0.57 
7 1 0.69 0.21 0.13 
8+ 6 0.47 0.05 0.04 


A graphical presentation of the number of accidents in a year of driving by 
9461 drivers in Belgium is given in Figure 1. The data appears in Table 2. The 
data was given by Thyrion (1961) and given further analysis by Seal (1971), 
Simar (1976), and Lambert and Tierney (1984). Assuming the number of acci- 
dents in a homogeneous population is Poisson, there is good reason to model the 
observed distribution as a mixture of Poissons. In Figure 1 the residuals r(x) are 
plotted against x, revealing the convex shape characteristic of a mixed distribu- 
tion [see (1.4)]. Also plotted on Figure 1 is the maximum likelihood fit of r(x) by 
the negative binomial density, which is the mixture model arising from the 
conjugate mixing distribution. One can see that it appears to inadequately 
describe the tail behavior of the empirical density p(x). 

The convergence criterion of Proposition 2.3 for the Poisson f with mean A 
and natural parameter rt = log À is 


(3.1) fe dP(1) < o. 


This converges for every dP,(r) of the generalized Poisson form a — oX for 
which ø is positive and X is Poisson (£). This implies that there exist exponen- 
tial family mixture densities of the form 


(3.2) log f(x; a, 8) = ax + Be* — k(a, B) — log x! 


for o positive. 

The shape of the r(x) graph suggests a reasonable fit could be obtained from a 
function t(x) with decreasing second derivative and so class (3.2) seems prom- 
ising. Figure 1 also shows the fit of exponential family Poisson mixtures for two 
choices of o in (3.2) 


#1: log f,(x) = ax + B exp(—0.25x) — k(a, £) — logx!, 
#2: log f.(x) = ax + Bexp(—0.50x) — k(a, 8) — logx!. 
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(The choice of o is a delicate issue which will be discussed in a remark at the end 
of this section.) From the parameter estimates (a, 8) for the two exponential 
models one can directly compute estimates of the posterior mean E[log p|x] and 
variance V(log |x) via (1.4). 

The parameter estimates used to fit these exponential models were obtained 
by minimizing over (a, 8, k) the weigkted sum of squares 


(3.3) LAP(x)(r(x) — ax — Bt(x) + kY. 


One nice feature of these estimators is that—unlike maximum likelihood estima- 
tion—one does not need to compute the summation constant k(a, 8); one can 
estimate it and thereby eliminate the need for iterative methods. What is 
surprising is that in a discrete problem with finite support these estimates are 
fully efficient. This is shown in the next section. Of course, the term k(&, Â) must 
be computed terminally in order to correctly standardize the fitted density. 

For comparison purposes Pearson’s chi-square goodness-of-fit statistic for these 
models was computed by grouping all data from 5 on. The values for f, fọ, and 
the negative binomial were 3.38, 13.50, and 14.69, respectively. Even allowing a 
degree of freedom for the selection of c, the density f, is a superior fit. 

Of course, this Poisson example does not have a finite sample space, and the 
use of the least-squares method here illustrates some limitations in this approach. 
Since the value of magnitude factor k(a, £) is estimated, it is primarily the shape 
of the density being fit. There is no penalty for a poor fit in regions where 
p(x) = 0. In the example, f(x) = 0 for x > 8, and in particular, if one used 
model (3.2) with o = —0.10, then the solution is inadequate. It gives a bimodal 
density with the larger mode lying beyond the range of the data. 

Information about the appropriateness of the least-squares fit can be obtained 
by evaluating A = k(&, Ê) — k. This is -he shift on the logarithmic scale between 
the least-squares fit and the correctly standardized fit for each x. Since it will be 
shown in Proposition 4.4 that A is also the Kullback—Leibler information 
distance between the empirical density and the restandardized fit, it follows that 
if A is small, the observed data are sure to be close to the model. In this regard 
Proposition 4.4 implies that 2nA would be the likelihood ratio goodness-of-fit 
statistic against the general multinomial alternative except that the least-squares 
estimators of (a, £) are substituted for the maximum likelihood estimators. Since 
2nA is therefore greater than the likelihood ratio statistic, it provides a conserva- 
tive test of the fit of the model. In the above models #1 and #2 the differences A 
were quite small, being 0.0003 and 0.0009, respectively. The shift in log fit is 
negligible on the scale of Figure 1. 


REMARK. When o is treated as a free parameter, as was done implicitly 
above, the resulting family of distribut:ons is no longer exponential family and 
the method of least squares becomes a nonlinear problem in the parameter o. The 
method of least squares could still be applied to choose 6, but in the given 
example the unbounded sample space makes this fail, as the solutions falls in the 
range where least squares gives a poor solution. Thus if the statistical user were 

I 
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to need more than the above results, a more precise maximum likelihood solution 
is called for. We do note, however, what has been gained by the exploratory 
analysis: an idea of the data structure along with a good set of initial parameter 
estimates. 


4, Least-squares estimators. In this section it is shown that certain least- 
squares estimators are fully efficient estimators for finite discrete exponential 
families. Two theorems are given; in the first, the standardizing constant for the 
density must be computed, in the other it is estimated. Conceptually, the first is 
more attractive, as it forces the correctly standardized density to lie near the 
empirical density; but the second least-squares estimator can be computed 
directly by ordinary weighted regression, and when the model fits the data well 
the standardization seems to be well estimated, as was seen above. 

This section is not meant to be a complete study of the properties of 
least-squares estimators for discrete exponential families; indeed the theorems 
herein raise further questions which will be addressed elsewhere. The preliminary 
results are offered here merely to indicate that the exponential family models 
being considered are not as computationally unattractive as one might have 
supposed. 


THEOREM 4.1. Suppose that 8 = (6,,...,9,), (x) = (t(x),...,¢,(x)), and 
that 


falx) = exp(0 + t(x) — k(8))A(x) 


is a discrete exponential family density with respect to counting measure on a 
finite support set {x,,...,x,} on which the junctions 1, t,(x),..., t,(x) are 
linearly independent. Then the vector § which minimizes 


R(4) = p(x) {log| p(x) /A(x)] — 6+ t(x) + &(8)}? 
is an asymptotically efficient estimator of 8. 


Proor. This is an element of a type of estimating equation considered by 
Rao (1961). Define the probability estimates for each x by p(x) = n(x)/n. Then 
the estimating equation is of the form 


YA(x)(log B(x) — log fo(x))( fe’ /fe) = 0. 


It is easily shown to satisfy Lemma 3 of Rao, and hence is first-order efficient. 
(Rao’s lemma is for univariate 9, but is easily extended.) O 


REMARK. We note that for n = 1, the solution is the maximum likelihood 
estimator. Moreover, one can apply the results of Rao concerning second-order 
efficiency to show that this estimator of @ has the same second-order efficiency as 
the minimum chi-square estimator and the minimum Kullback—Leibler distance 
estimator. 
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THEOREM 4.2. Under the same conditions as Theorem 4.1 the component 8 
of the vector (8, k) which minimizes 


R*(8, k) = DA(x){log| B(x) /h(x)] — 0+ t(x) + k}? 
is an asymptotically efficient estimator of 8. 


Proor. Although this could again be proved using Rao [see Remark (1) 
below], a direct proof offers some insight. Consider univariate 8. The solution in @ 
to the weighted least-squares problem may be obtained by first regressing out the 
constant term. For univariate @ this gives 


5. EP(z) (lost B(x)/A(2))} (x) -D 
Lp(x)(e(x) — t) 
where £ = X p(x)t(x). Moreover, 
vn Up(x) {log P(x) — log fo(x)}(t(x) — E) 
n (ô — 0) = =a : 
a ESENE) ~ 


As n —> oo, the denominator converges in probability to Var T. If we view the 
numerator of # — 6, as a function of p, say g(p) and use the delta method to find 
its limiting distribution about p = f,,, we obtain the result. O 





3 





By the delta method, (4) is a fully efficient estimator of k(0). The following 
corollary shows that the least-squares estimator & is also. 


COROLLARY 4.3. The least-squares estimator k of the standardizing constant 
k(0,) is asymptotically efficient. 


Proor. We write 
k= - E p(x)logl p(x) /h(x)] + ĝt 
and so 
hk — ky = (6 - 0 )Ë - Z P(x)(log P(x) — log fy,(x)). 


The second term has, when multiplied by —2n, an asymptotic x7(s — 1) distri- 
bution, hence 


Vn (k — ko) = Yn (8 — 6 )é +0,(1), 
which implies the limiting distribution (0, E?(T)/Var(T)), thus achieving the 
required lower bound for the variance of an estimator of k. O 
We define the Kullback—Leibler information distance K( p, f ) between two 
discrete densities by 


K(p, f) = Lp(x}log| p(x)/f(x)). 
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The following proposition establishes some structural properties of the weighted 
least-squares solution with regard to this distance. 


PROPOSITION 4.4. (i) The least-squares estimators (â, Ê, k) satisfy 


he log{ £ h(x)et* Ao) < k(â, B). 
p(=)>0 
(ii) The Kullback-Leibler information distance between p(x) = n(x)/n and 
the density f = f(x; 4, Ê) is 


K(p, f) a k(â, B) F k. 


Proor. Let k*(â, Ê) be the middle term in the above inequalities. We have 
from the least-squares equations (with intercept) that the weighted residuals sum 
to zero: 


(4.1) L(x) [log B(x) - log f(x)] = 0, 


where f(x) is the fitted value using È. However, the information inequality for 
probability mass functions implies 


(4.2) L(x) [log A(x) — log f*(x)] = 0, 


where f*(x) is any probability mass function with support inclusive of that of 
p(x). In particular, {*(x) = exp(k — k*(&, B))f(x) is a density on the range 
{x: p(x) > 0}. Using f* in (4.2), together with (4.1), gives result (i) of the lemma. 
Use f = exp(k — k(4, Å) f(x) on the full range to get (ii). 0 


REMARKS. (1) One can generalize Theorem 4.2 to considering the problem of 
estimating @ in f,(x) = g,(x)exp(—(@)), where g(x) is some positive function 
of 0 and x and exp(k(@)) = L,g,(x). Then the problem 


min LA(x)(log P(x) + k — log ge(x))” 
becomes after minimization over k 
min  p(x)(log B(x) — log go(x) - LP(y)log p(y) + LAC y)log gel y)) - 


Again, using an adapted version of Rao’s Lemma 3, one can demonstrate full first 
order efficiency for this functional of p. 

(2) Although weighted least-squares estimators have been used in various 
guises in the statistical literature for discrete data analysis [e.g., Grizzle, Starmer, 
and Koch (1969)] to the author’s knowledge this is the first explicit recognition 
that the standardizing constant can be treated as an unknown intercept. Its use 
seems to be implicitly advocated in Gabriel and Zamir (1979). 

(3) A small computer experiment substantiated that the least-squares estima- 
tors can have reasonable efficacy in small samples. Samples of size 10 and 20 
were taken from the binomial (2,0.75) distribution. Expected counts of (0, 1, 2) 
were therefore (0.625, 3.75,5.625) and (1.25,7.5,11.25), respectively. Since the 
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maximum likelihood estimator of 6: = log( p/1 — p) is infinite on points of the 
sample space, and the least-squares estimator is undefined on some points we 
cannot precisely define mean square error here. However, in 362 Monte Carlo 
repetitions of the sample size 10 experiment (the trials terminated at the first 
“bad” sample) the sample mean square error of the m.l.e. was 0.292 and the mean 
square error of the least-squares estimator was 0.338. Note that the asymptotic 
variance for this problem is 0.266. In the second Monte Carlo experiment, after 
1150 iterations the same mean square errors were 0.139 and 0.130 for the m.l.e. 
and least-squares estimators, respectively (cf. asymptotic variance 0.133). 


5. Concluding remarks. Some useful comparisons can be made between the 
mixture dispersion models of this paper and the dispersion models used in the 
generalized linear models of McCullagh and Nelder (1983). In the univariate-x 
case the quasi-likelihood methods that they develop are exactly maximum 
likelihood methods when the underlying density has the form (McCullagh, 1983) 


(5.1) f(x; 8,0?) = h(x)exp(o-?(x — b(0)) — e(o?, x)). 


Here o? represents a dispersion parameter, and for comparison purposes think of 
o? = 1 as being the basic exponential family model. The practical advantages of 
this structure are extremely important: the mean of X and indeed the likelihood 
equations in the 6-parameter do not involve o”, so regression modelling in the 
mean value of X is straightforward. Moreover, since (as will be shown) o? is 
effectively a sample size parameter, the model easily accommodates observations 
which are themselves means of varying sample sizes. The exponential families 
discussed earlier in this paper do not in general have these nice modelling 
properties, being directly applicable only to a random sample from one popula- 
tion. 

On the other hand, the model given by (5.1) is only a single possible represen- 
tation of dispersion. In fact it is one generated by convolutions rather than 
mixtures, as we now demonstrate. 

The moment generating function for (5.1) in variable s is 


(5.2) els) = eT (b(6+8)—6(8))_ 


For (5.2) to be a moment generating function for all o? > 0 implies that it is 
infinitely divisible; it is in fact [as can be seen from (2.6)] the exponential family 
tilt in 8 of an infinitely divisible distribution with convolution parameter £ = 072. 

Thus one cannot in general create a model of the form (5.1) to represent 
dispersion about a baseline exponential model. However, (5.2) will represent a 
true moment generating function for all positive integer values of o~?; these are 
just the convolutions of the baseline density. Thus one could interpret o~? 
(5.1) as being the unknown sample size of the sample from which the observation, 
the sample mean X, was taken. 
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Gaussian Markov distributions are characterised by zeros in the inverse 
of their covariance matrix and we describe the conditional independencies 
which follow from a given pattern of zeros. Describing Gaussian distributions 
with given marginais and solving the likelihood equations with covariance 
selection models both lead to a problem for which we present two cyclic 
algorithms, The first generalises a published algorithm for covariance selec- 
tion whilst the second is analogous to the iterative proportional scaling of 
contingency tables. A convergence proof is given for these algorithms and this 
uses the notion of I-divergence. 


1. Introduction. Most modelling of jointly Gaussian (normal) random vari- 
ables involves the specification of a structure on the mean and the covariance 
matrix K. However, models which specify structure on K~! have also been 
developed, although they are seemingly less popular. Our interest in this paper 
focuses on the covariance selection models, introduced by Dempster (1972) and 
studied by Wermuth (1976a, b), in which certain elements of K~} are assumed to 
be zero. 

In Section 2 we show how zeros in K~! correspond to conditional indepen- 
dence statements and characterise all such statements consequent upon a given 
pattern of zeros. The characterisation is achieved by associating a simple graph 
[Behdzad et al. (1979)] with the elements of K~! and providing rules for reading 
the graph. The results are a direct analogue of those given in Darroch et al. (1980) 
for contingency table models; see also Speed (1979). 

The likelihood equations for covariance selection models lead naturally to a 
consideration of the problem of finding Gaussian distributions with prescribed 
margins. The results in Sections 3 and 4 provide a solution to this problem and a 
general algorithm for constructing the required distributions is given. Two special 
cases of this algorithm are considered. The first one is a generalisation of an 
algorithm in Wermuth and Scheidt (1977) whilst the second one has properties 
analogous to iterative proportional scaling for contingency tables [Haberman 
(1974)]. The notion of J-divergence [Csiszár (1975)] or discrimination information 
in the terminology of Kullback (1959), plays an important role in the convergence 
proof of this algorithm. 

Finally, in Section 5 we show how the JI-divergence geometry of Csiszár (1975) 
provides a framework in which both algorithms can be seen to be an iterated 
sequence of I-projections. 
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2. Conditional independence for Gaussian random variables. In the 
following we consider a random vector X having a Gaussian distribution with 
mean 0 and positive definite covariance matrix K. The components of X will be 
indexed by a finite set C and for a C C we write X, for the subset of the 
components of X indexed by a, namely (X,: y € a). The covariance matrix 
K = (K(a, B): aß €C) on C is defined by K(a, 8) = E{X,X,}, a, BEC, 
where E denotes expected value. For subsets a,b c C, Ka , = {K(a, £): a € 
a, B € b} denotes the cross covariance matrix of X, and X,. When a = b we 
write K, instead of K, a- Note that care must be taken to distinguish between 
K;' and (K7~'),. The density p(x) of X is, of course, 


(1) p(x) = (27) (det K)~’exp{-1x’K-*x}, xe RI, 


where |-| denotes the cardinality of the argument. Marginal densities are 
subscripted by their defining sets, e.g., p,(x,,) or simply pa, refers to the marginal 
density of X,, where a is an arbitrary subset of C. 

Proposition 1 relates the conditional independence of two components of X to 
the structure of K. In the proposition and following we abbreviate the set 
intersection a N b to ab and write a \ b for the complement of b in a. The set 
C\ b will be denoted b. 


PROPOSITION 1. For subsets a, b of C witha U b = C the following state- 
ments are equivalent. 


(i) Kua, b7 K, ab Kk 6K ap, b 
op 2 
(i ) Kay b, b\a 7 Kas, ab abK ab, b\a‘ 


(ii) (Kase, b\a = O. 
(ili) X, and X, are conditionally independent given X ,,. 


PRroor. (i) and (i) are easily seen to be equivalent by partitioning the rows of 
K over a\ b and ab and the columns over b\ a and ab. By partitioning over 
a\ b, b\ a, and ab, a straightforward use of the expression for the inverse of a 
partitioned matrix [Rao (1973, page 33)] proves that (i) is equivalent to (ii). The 
standard formula (2) for the conditional covariance matrix gives the connection 
between (iii) and (i^, 


(2) cov(X a\ 5: Xs\aXKas) = Kano, b\a Kao, abKabK ab, b\a' 0 


A useful special case of the above proposition is the following corollary, given 
by Wermuth (1976a). 


COROLLARY 1. For distinct elements a, 8 of C, X, and Xp, are conditionally 
independent given Xa, gy iff K~a, B) = 0. 


Proor. Put a = C\ {a} = {a} and b = {£Y in Proposition 1. 0 


Having shown that zeros in K~! correspond to conditional independence 
statements we now describe all such statements which follow from a given 
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pattern of zeros in K`’. To do this we associate a simple undirected graph with 
the pattern of zeros and then give rules for reading the graph to obtain the 
independence relations. 

To begin, some graph-theoretic notation and definitions are needed; for a 
general reference see Behdzad et al. (1979). Our simple undirected graph will be 
denoted by C = (C, E(C)) where C is the vertex set, and E(C) the edge set 
which consists of unordered pairs of distinct vertices. Pairs of vertices {a, 8} € 
E(C) are said to be adjacent. A maximal set of (= 2) vertices for which every pair 
is adjacent is called a clique. For any vertex y we write dy = {a: {a, Y} € E(C)} 
for the set of neighbours of y. We also write y = y U dy. 

An important notion is the separation of sets of vertices in C. To define this we 
first need to define a chain which is a sequence Y = Yo, ¥),---> Ym = B of vertices 
such that {y, Ym} E€ E(O) for l = 0,1,...,m — 1. If yp = Ym the chain is called 
a cycle. Two sets of vertices a, b are said to be separated by a third set d if every 
chain connecting an a € a toa f € b intersects d. 

The graph C is said to be triangulated [see Lauritzen et al. (1984)] iff all cycles 
Yo: Yis+++» Yp = Yo Of length p > 4 possess a chord, where a chord is an edge 
connecting two nonconsecutive vertices of the cycle. 

Finally, the graph Č complementary to C has vertex set C and edge set E(C) 
with the property that {a, 8} € E(C) iff a + B and {a, 8} £ E(C). Example 1 
illustrates these ideas. 


EXAMPLE 1. The graph C with vertex set {1,2,3,4} and edge set {{1,2}, 
{1,3}, {1,4}, {2,3}, {8,4}} could be depicted as in Figure 1. For this graph the 
set of neighbours of 1 is {2,3,4}; the cliques are {1,2,3}, {1,3,4}; a chain from 
{2} to {4} is 2,3,1,4 and {2} is separated from {4} by {1,3}. Figure 2 shows the 
complementary graph. 

As it stands the graph in Figure 1 is triangulated. However, if the edge {1,3} 
were removed we would have the simplest example of a nontriangulated graph. 


The characterisation of all conditional independence relations consequent 
upon a given pattern of zeros in K~! is presented in Proposition 2. 


PROPOSITION 2. Let C be a simple graph with vertex set C indexing the 
Gaussian random variables X. Then the following are equivalent. 


(i) K-a, 8) = 0 if {a, B} € E(C) and a + B; 


2 3 2 3 
e 
4 hS 
1 4 1 4 


Fic. 1 FIG. 2 
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The local Markov property: 
(ii) For every y € C, X, and X,,,, are conditionally independent given X ;,; 
The global Markov property: 


(iii) For every a, b and d with d separating a from b in C, X, and X, are 
conditionally independent given X 4. 


ProoFr. To show the equivalence of (i) and (ii) we note that (i) is equivalent 
to K~'(y, {y}’) = 0. Putting a = {y} and b = {y} in Proposition 1 then proves 
the result. 

The equivalence of (i) and (iii) for the case a U b U d = C follows in a similar 
way if we put “a” = a U d and “b” = b Ud in Lemma 1. WhenaUbUd#C 
a simple maximality argument as in Vorobev (1963) shows that maximal sets 
a*, b* exist such that a G a*, bc b*, a* U b* U d = C, and a* is separated 
from b* by d. Proposition 1 then gives us p = p,-p,»/pq and integration to 
obtain the marginal density of X usuga shows that (i) implies (iii). 

The implication in the reverse direction follows on noting that if (a, 8) € E(C) 
then a, 8 are separated by {a, 8Y. Hence by (iii) X, and X, are conditionally 
independent given X,, gy and Corollary 1 shows that K~ (a, 8) = 0.0 


The results of Proposition 2 are illustrated in Example 2. 


EXAMPLE 2. Suppose K`? has the following pattern with * denoting a 
nonzero element: 


Or we to ND Oe 
ooo # # he 
*oew 8 N 
+ *¢ # # OD Ww 
+e rOO Pf 
eKO 


Then the corresponding graph C would be as shown in Figure 3. If we put 
y = {2}, dy = {1,3,5}, and use the local Markov property we deduce that X, 
and X, are conditionally independent given X,, 55). Similarly with a = {1}, 
b = {4}, and d = {2}, the global Markov property can be used to assert that X, 
and X, are conditionally independent given X}. 


3. Gaussian Markov distributions with prescribed marginals. In this 
section we consider the problem of finding a Gaussian probability measure with 
prescribed marginals, i.e., we seek a joint probability density p whose marginals 


(3) Pote Pe, 
are known beforehand, c,,...,c, being proper subsets of C. (The notation is 


explained after (1) above.) Clearly if our marginal specifications are consistent it 
is necessary to give only the maximal c, in (8). 


142 T. P. SPEED AND H. T. KIIVERI 


3 


As motivation for this problem consider the following. Suppose we have n 
independent and identically distributed observations x,,...,x,, from (1) and we 
wish to find a maximum likelihood estimate of K subject to certain elements of 
K~' being zero. When written in our notation, the likelihood equations for such 
a model (Dempster, 1972) are: 


r K(a, B) = S(a,B) if {a, 8} E E(C) ora = B, 
4 


K-(a,ß)=0 if (a, B} ¢ E(C) and a + B, 
where nS = X” xx". The first equation in (4) is easily shown to be equivalent to 
(4°) K,=S, ifce (0), 


where @(C) is the class of cliques of C. Since a Gaussian distribution with mean 
zero is completely specified by its covariance matrix, (4) amounts to specifying 
the marginal distributions p, for e E€ €(C). 

Theorem 1 can be used to describe the class of Gaussian measures with 
prescribed margins. 


THEOREM 1. Given positive definite matrices L and M defined on the vertices 
C of a graph C = (C, E(C)) there exists a unique positive definite matrix K such 
that 


(i) K(a, £) = L(a, B) if {a, B} © ECO) ora = B, 
(ii) K-a, B) = M(a, P) if {a, B} £ E(C) and a + B. 


Equivalently 
G) K, = L, if e € €(C); 2 
Gi) KE, č) and M(@, č) agree except on the diagonals, č € €(C). 


Proor. The equivalence of (i) and (1^ follows from the relation 
(5) EC)= U U {eB}. 


cE F(C) (a, B}ae 
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Replacing C by Č in (5) enables the equivalence of (ii) and (ii’) to be demon- 
strated. 

The main result of Theorem 1 can be established using the theory of exponen- 
tial families [Barndorff-Nielsen (1978), Johansen (1979)] and such a proof is 
sketched by Dempster (1972, Appendixes A and B). 

The results in Section 4 will show how to generate a sequence of matrices 
converging to the K of Theorem 1 and thus provide an alternative proof. We 
prefer this proof as it provides a basis for simple numerical algorithms which do 
not require Newton—Raphson type iterations or storage of large matrices to 
compute K. O 


Replacing the L in Theorem 1 by the sample covariance matrix and setting 
M = I shows that the estimation problem for covariance selection models has a 
well defined solution. When M = I, the K in Theorem 1 gives the Gaussian 
distribution with maximum entropy satisfying (i) or (i’) [see Dempster (1972)]. 

Note that varying the M in Theorem 1 gives the family of distributions with 
margins prescribed by L, c © @(C). 

In the next section we will make use of the notion of the I-divergence of two 
positive definite matrices. This is defined by 


(6) I(P|R) = —}{logdet(PR~') + tr(I- PR7')}. 


The definition (6) results from evaluating the discrimination information measure 
of Kullback (1959), namely /{p(x)log{ p(x)/r(x)} dx for the two Gaussian distri- 
butions with densities p(x), r(x) defined by covariance matrices P, R. When it 
exists, the -divergence behaves somewhat like a norm on a space of probability 
measures (Csiszár, 1975), although it is not. 

Some properties of (6) which we will use later are given in Lemma 1. We write 
P for the set of |C| X |C] positive definite matrices and regard this as a (convex) 
subset of R? where g = |C|”. In the following a set of unordered pairs of (not 
necessarily distinct) elements of C will be denoted by E. 


LEMMA 1. The I-divergence S(-|- ) has the following properties. 
(i) If P, R € P, #(P|R) = 0 with equality iff P = R. 
(ii) Given P, R € F, if there exists a Q © P such that 
(a) Qla, B) = Pla, B) if (a, B) E E, and 
(b) Qa, 8) = R (a, B) if (a, B) £ E. then 
(7) I(P|R) = F(PIQ) + (QIB). 
If such a Q exists it is unique. 


(iii) If {K,,} and {L,,} are sequences contained in compact subsets of P then 
I(K,|L,) > 0 implies K, — L, > 9. 


Proor. The first assertion is a well known property of the Kullback informa- 
tion measure so we focus on (ii) and (iii). 
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(ii) A simple calculation shows that for Q € Z 
(8) I(PIQ) + #(Q|R) = #( PIR) — 3 tr{(Q— P)A}, 


where A = Q`? — R7!. Conditions (a) and (b) then ensure that the trace term in 
(8) is zero. 

To prove uniqueness suppose Q, and Q, satisfy (a) and (b) of (ii). Then setting 
P= R = Q, shows that 


F(Q,1Q;) g S(QIQ2) + F(Q21;) 


and since -divergences are positive unless both arguments are equal we must 
have Q, = Q2- 

(iii) Suppose ¥(K,|L,) > 0 but K, — L,» 0. Then there exist convergent 
subsequences K„ > K and Ly > L with K # L. By continuity %(K,,|L,,) > 
S(K |L) + 0, which is a contradiction. 0 


4. Algorithms. This section develops two algorithms for constructing the K 
of Theorem 1. The first algorithm preserves (i^) of Theorem 1 throughout the 
iterations and cycles through č € ¢(C) forcing the off-diagonal elements of 
K~'(@, č) to zero. The second algorithm preserves (ii^) whilst forcing K, = L, as 
it cycles through c € ¢(C). Both of these algorithms are special cases of a more 
general cyclic algorithm and we begin by presenting this algorithm. Throughout 
the discussion E,, E,,..., Em denote sets of unordered pairs of (not necessarily 
distinct) elements of C whose union is denoted by E. 


4.1. A general cyclic algorithm. The general cyclic algorithm is designed to 
solve the following problem. Given G, H € # find an F €e P with the property 
that 


(9) F(a,B) = G(a,8) if(a,B) EE, 

(10) F-‘(a, 8) = H(a,8) if(e,ß) €E. 

The algorithm is defined as follows. Generate a sequence {F } of positive definite 
matrices satisfying F} = H`? and, for n > 1, 

(9°) F(a,8)=G(a,B)  if(a, B) € Ey, 

(10) F; (e, B) = F; 1,(a, B) if (a, B) E Eps 


where n’ = n(mod m). Basically the idea is to maintain (10) throughout the 
sequence whilst cycling through the E„ and forcing (9). The crucial step in the 
algorithm involves going from F,_, to F, Assuming for the moment that this 
step can be performed, a convergence proof for this algorithm, modelled upon 
that found in Csiszár (1975, Theorem 3.2), is given in Proposition 3. The two 
algorithms to be discussed are examples for which the sequence {F,} can be easily 
constructed. We write N for the set of nonnegative integers. 


PROPOSITION 3. The sequence {F,} generated by the general cyclic algorithm 
converges to the unique F € P with the properties (9) and (10). 
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Proor. By (ii) of Lemma 1 we can write for r > 1 
(11) F(GIF.,) = 4(GIF,) + #(FIE..)- 


Summing relations of the form (11) over r gives for u 2 1 


(12) S(GIF,) = 4(GIR,) + X S(E]E) 


rel 
and from (12) we deduce that 
(13) {F,} © {F: #(GIF) < 4(G|A)} =A (say). 


The set A is compact since ¥(G|F) is strictly convex (as a function of F~') with 
a unique minimum. From (12) it also follows that 


(14) E S( FIE.) < S(GIF,). 


re} 


Hence L*_, ¥(F,|F._,) is convergent and ¥(F,|F._,) > 0 as r > oo. 
Now by (13) the vector sequence {F m+1; Fimsos-++> Pomem): 8 2 0} has a 
convergent subsequence, defined by s E N, CN, with limit (F*, F*,..., EX) 


say. For any 2 < t < m we can write 
(15) (F, = F) = (F, r Fame) + (Fari Fonit) T (Enta En, F). 


Letting s € N, > œ and using (iii) of Lemma 1 with L, = K,_, shows that 
F* = Fx = --- FE*ž = F (say). Note that (10) holds for each F, and hence for the 
limit F. Similarly for each s € N, and ¢, F,,,.,(a, 8B) = G(a, 8) if (a, 8) € Ep 80 
the same property holds for the limit F, i.e., (9) holds. 

A similar argument for any other convergent subsequence shows that the limit 
point satisfies (9) and (10) of our proposition. Lemma 1, part (ii) then establishes 
that all convergent subsequences have the same limit and hence {F,} converges. 

o 


The next lemma enables sequences {F,} satisfying (9) or (10°) to be con- 
structed when either 


(16) E, = {(a,B): a, BE a,c C) 
or 
(17) E,= {(a,B): 4,8 € a,c C,a# B}. 


LEMMA 2. Suppose Q, R, and B © P. Then 
@ for a c C the matrix 
(18) e= =R | 
is positive definite and satisfies 
(a) Qla, B) = B(a, B) if a € a and B € a; and 
(b) Qa, 8) = R Ya, f ataorf ta. 


Bz} — Rj! H 
0 0 
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(ii) The matrix Q is given by 
B, B,Rz'R, a 
(19) Q= -1 -1 ' -1 
Ry aka B, Ry g Ry aBa (I ae B.R )R 
iii) We have the expression: 
(20) (QIR) = —}{logdet B,Rz* + tr(I, — B,R,")}. 


a,a’ 


Proor. (i) We use the density scaling of Kullback (1968). In the Gaussian 
case, given densities b(x) and r(x) corresponding to positive definite matrices B 
and R, scaling so that r,(x,) agrees with b,(x,) corresponds to computing 


(21) g(a) = ERE 


Expanding the right-hand side of (21) gives 


q(x) = (ony det R det =)" 


(22) det R, 
1 Bz- R3 0 
xexp|— 5x" je + E j o)l), 
which by (18) is just 
(23) (27) “I? det Q) P exp{ — 1x1Q-tx} : 


The properties (a) and (b) are now immediate. A direct proof using matrix algebra 
can also be given. 
The proofs of (1) and (iii) are straightforward so we omit them. O 


The two algorithms discussed below correspond to choosing the a, in (16) and 
(17) to be the cliques of C or Č, respectively. In the following we will abbreviate 
the class of cliques of C by @ and the class of cliques of Č by Ë. The notation 
diag( A) refers to a diagonal matrix whose diagonals are the same as those of A. 


4.2. The first cyclic algorithm. List the cliques of the complementary graph C 
as é,,...,¢, and generate a sequence {K,} as follows: Ky = L; for s &N, 
lstsm, Km: ™ ZK ,.m+:-1) where Z,(K) = Q`}, Q being the matrix (18) 
of Lemma 2 with R = K~', a=@,, and B, = diag((K~')z')~*. The fact that 
this sequence converges to the required matrix K when M =I follows from 
Proposition 3 on replacing a, in (17) by č, and making the identifications 
F, = K;', G = M, and H= L. It does not seem possible to give an explicit 
expression for B, in the case when M # I. 

For this algorithm the elements of the sequence {K,,} are fixed over € whilst 
the elements of {K,'} vary over @. From a computational point of view it is not 
necessary to compute the sequence {K,} by inverting K7!' at each step. The 
expression (18) provides a simple updating formula for K,, given K,,_,. Hence it 
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is only necessary to invert |¢| x |é| positive definite matrices when cycling 
through č E €. 

The cyclic algorithm of Wermuth and Scheidt (1977) is also a special case of 
the general algorithm. Instead of using the cliques of Č these authors cycle 
through the edges {a, 8} € E(C). The 2 X 2 matrix inversions required are 
explicitly performed and used to give a simple updating formula. Their algorithm 
is defined in the same way as above but they have a € E(C) and 


B ae 0 | 
a 0 u` 


asl ol 
and ô = uw — v’. It is easily seen that at each step the current value of K(a, B) 
is changed by —v/6 so that K~ (a, 8) = 0. A computer program for performing 
the adjustments is given in Wermuth and Scheidt’s paper. 


where 


4.3. The second cyclic algorithm. Enumerate the cliques of C as c,, co,.-.; Cm 
and define a sequence {K,} as follows: K= Mt; for s20, l<t<m, 
K smt = YA Kymit-1), where YK) = Q, Q being the matrix (6) of Lemma 1 
with R = K, a= c, and B= L. Making the identifications a, = c, in (16) and 
F, = K,, G= L, and H = M in Proposition 3 shows that the second algorithm 
converges to the K of Theorem 1. This result also gives an alternative proof of 
Theorem 1. Note that {K 71} is held fixed over whilst {K,} varies over . 

That this second algorithm is analogous to iterative proportional scaling for 
contingency tables should be clear. At each step we “scale” the current covari- 
ance matrix to match the relevant “margin” L, We can also connect this 
algorithm with a general procedure in Kullback (1968) where, however, the proofs 
are incomplete. Using our notation, Kullback’s procedure can be described as 
follows. Given the required marginal densities g.,...,g,_ and an initial density 
a(x) construct the sequence { f,,} (assumed to exist) defined by 


fo(x) = (x), 
andfors2>0,1stsm 
fem+t-(X)8e(X c 
( omst- lZ) 
Note that this simply amounts to scaling the previous density to ensure the 
desired marginals and this is how we obtain the matrix Q of Lemma 2. Hence the 


second cyclic algorithm is a Gaussian version of Kullback’s general procedure. It 
can also be shown to be a cyclic ascent algorithm. 


fom+i(X) = 


4.4. Finite termination. When the graph C is triangulated and M = I the 
second cyclic algorithm converges after one cycle if the cliques are suitably 
ordered. This result is completely analogous to the one cycle convergence of 
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iterative proportional scaling for contingency tables when the generating class is 
decomposable [see Haberman (1974, Chapter 5)]. 

To demonstrate the result we need the following two lemmas. Without loss of 
generality we assume that the graph C is connected. 


LEMMA 3. If C is triangulated then there exists an enumeration c}, ..., €m Of 
the cliques such that for i = 2,..., m 
1—1 
(24) c, Uc, ø. 


l=1 


Proor. The result is obtained by successively removing detachable cliques 
from C [see Lauritzen et al. (1984)]. O 


Note that (24) states that for each i the clique c, contains a vertex not in c; for 
l=1,...,t-1. 

The second lemma gives an expression for the determinant of the matrix K in 
Proposition 1 which is useful in proving the finite termination of the second 
algorithm. 


Lemma 4. Suppose K € P and K xs, ina = O fora, b witha U b = C. Then 
(25) det K = (det K,)(det K,)/det K,,.- 


Proor. Note that (iii) of Proposition 1 implies p = p, p,/DP,,- Evaluation at 
x = 0 then gives the result. 0 


PROPOSITION 4. If the cliques of C are ordered as in Lemma 3 and we start 
the second cyclic algorithm with K, = I, then 


(i) (K,,). = Le fore E €; = 
(ii) (Kz) is diagonal for č € €. 


PRooF. We will prove that *(K|K,,) = 0 where K is the unique matrix of 
Theorem 1 with M = I. This will follow directly from (12) provided we can show 
that 


(26) S(KII) = X #(KIK,_1) 
rl 
and we prove this by induction on m, the number of cliques. It is clearly true for 


m = 1 and so we assume that it is true for all m < q where q 2 1. If we can 
prove 


(27) F(K\I) = §(K,.:/K,) +S (KI), 


where ¢ = UZC, then (26) will follow for m = q + 1; q steps of the second 
algorithm starting from K, = I generate matrices having the form 


I 0 : 
x= |; zh t=1,...,4q, 
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where Ķ, is |é| x |@| and from the inductive hypothesis 
q q 
I( KJ) = L¥(KIE,_1) = L4(KIK,_,)- 
1 1 


Turning now to the proof of (27) we remark that it follows from Lemma 4 with 
a =c,,, and 6 = ¢, the relationship (20) with Q = K,,,, R= K, and a = c+: 
as before, and the fact that 


(Edel 2b 


The log det terms in the definition of S match up by Lemma 4 and the trace 
terms correspond by (20) and the fact just noted. O 


We conclude this section with a few remarks comparing the two algorithms. 
When M = I, the main drawback of the first algorithm is the need to invert L at 
the beginning. It is possible that a numerical inversion of L could be difficult or 
impossible yet the second algorithm would work. This problem aside, it should be 
clear that the choice of which algorithm is to be favoured in any given situation is 
very much dependent on the number and sizes of the cliques in @ and @. 
However, if C is triangulated and M = J, the finite termination property of the 
second algorithm makes it attractive. 


5. Some comments about the geometry. To give a geometric interpreta- 
tion of the two algorithms it is convenient to define the “subspaces” P, , = 
{Pe P: P,=L,}, 2y,2= {Q € P: (Q71) agrees with M, except on the diago- 
nal}, and FP, g= A{FP,, 2c € F}, By g= {By = EE GF}. 

Equation (7) bears a resemblance to Pythagoras’ theorem and clearly for all 
PEP, , we have ¥(P|R) 2 ¥(Q|R) with equality iff @ = P. Hence one can 
call the matrix Q the J-projection of R on to #, , [see Csiszár (1975)]. 

Viewing the adjustment defined by Q in Lemma 2 as an J-projection we can 
give an interpretation of the two cyclic algorithms as follows. 

The first algorithm begins with a K, E€ F, 4 and cycles through č € @, 
I-projecting the current estimate of K onto P, 42, in order to obtain the 
required element in F, gN 2, g. The fact that we are J-projecting follows from 
(ii) of Lemma 1. Using this, for all K € 2, , we have 


I(K™ Ro) = #(K-'NQ) + ¥(Q|R™) 
or equivalently 
J(RIK) = F(Q™K) + #(RIQ™), 

and so #(R|K) = #(R|Q~*) for all K € 2, , with equality iff K = Qu‘. 

For the second algorithm we begin with K, € 2y, g and cycle through c E€ @, 
I-projecting the current estimate K onto 2y gO Pr e 

Both of the above algorithms are analogous to computing the projection onto 
the intersection of nonorthogonal (linear) subspaces by successively projecting 
onto each subspace [see for example von Neumann (1950, Chapter 13)]. 
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SIMULATED POWER FUNCTIONS! 


By RUDOLF BERAN 
University of California, Berkeley 


Tests for a null hypothesis whose specification involves an unknown 
nuisance parameter may be obtained by inverting a bootstrap confidence 
region for the parameter being tested or by constructing a mmulated null 
distribution for the test statistic. The power of ether test against certain 
alternatives involving the same unknown nuisance parameter can itself be 
estimated by simulation. 


1. Introduction. Bootstrap distribution estimates generate confidence re- 
gions of approximate level 1 — a in a variety of statistical models, including some 
models for which alternative constructions of confidence regions encounter sub- 
stantial technical difficulties [cf. Efron (1979), Bickel and Freedman (1981, 1982), 
Beran (1984), and Beran and Srivastava (1985)]. Underlying the bootstrap is the 
concept of simulation: the fitting of a mathematical model to observations on a 
system and the subsequent use of the fitted model to mimic, or simulate, the 
system. Simulation is a well-established technique in disciplines as diverse as 
numerical weather forecasting, the calculation of tide tables, and economic 
forecasting. 

Simulation ideas also have application in statistical hypothesis testing. Tests 
for a null hypothesis whose specification involves an unknown nuisance parame- 
ter may be obtained by inverting a bootstrap confidence region for the parameter 
being tested or by constructing a simulated null distribution for the test statistic. 
The power of either test against certain alternatives involving the same unknown 
nuisance parameter can itself be estimated by simulation. The uniform con- 
sistency of such simulated power functions is the main result of this paper. 

Consider the following general situation. Suppose = and © are metric spaces 
with metrics m,, m,, respectively. The observations X,, X,,..., X, are indepen- 
dent identically distributed random vectors with joint distribution Pg », which 
belongs to a parametric family {P7,: (é,@) € Q}. The parameter space Q is a 
subset of = x @. The subscript “A” in ($4, 04) designates the “actual” parameter 
values which are supposed to underlie the experiment. Both , and @, are 
unknown. Let £, be a specified element of =. We wish to test the null hypothesis 
that ¢, = € against alternative hypotheses in which £, differs from £p, with 84 
being viewed as a fixed nuisance parameter. 

More formally, the testing problem under consideration is 


(1.1) H, o: the {X,;1 <i <n} have distribution PZ 4., ($o, 04) €Q 
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versus alternatives of the form , 
(1.2) H, gẹ: the {X,; 1< i< n} have distribution P*, (¢,01) EQ, §&# &, 


with 8, being the unknown actual value of @ in the experiment. The alternatives 
(1.2) are of particular interest because they include the actual distribution Pf 4, 
of the data when the null hypothesis H„ o is false; and because the power 
function of a test over the alternatives (1.2) is typically estimable while power 
against P? 4 is not. Tacit is the requirement that the model {P} e: ($, 8) € Q} 
be general enough to contain, or reasonably approximate, the actual distribution 
of the data. 


The test statistic approach. Suppose T, = T,( X,, X2,..., X,,) is a test statis- 
tic for the null hypothesis H, o Let K, 7(&,@) denote £[T,|P",], the distribu- 
tion of T, under Př, and let 


(1.3) Kn r(x; £, 0) = PRs T, > x] 

be the corresponding survival function. For a € (0,1), let 
dn L(a; o0) = inf{x: K, p(x; o0) < a}, 
dn, (} £o, 8) = sup{x: Kn, r(x; 0,8) > a}. 


Suppose 6, is a consistent estimate of @, and d,(a; $o, 6) is any random variable 
lying between d,, (a; o, Ôp) and dp, (a; £5, §,). Define the test p, by 


1 if T,>d,(a;%,4,), 
: X) = n n 07n 
oe) Pal) . otherwise. 


(1.4) 


The critical value d,(a; o, 6,) is an upper a-point of K,, 7(&, Ê), the simulated 
null distribution of T,. In practice, d,(a; o, 6) can often be approximated by 
performing a Monte Carlo simulation of the distribution K, 7(é, 6). This 
calculation is an extension of the more familiar Monte Carlo technique for finding 
a critical value when testing a simple hypothesis. Under conditions to be 
described in Theorem 2.1, the test », has asymptotic level a under H, o- 

The power of 9, against the alternative H, ; is 


(1.6) Bn, gl &3 $, 04) = PPa |T, > d,(a; $o 6,.)] : 


To estimate 8, ,(a; $, 04), we might seek an analytical asymptotic approximation 
to Bn, (a; £, 04); and then replace 6, by 8, wherever @, appears in this approxi- 
mation. Unfortunately, the available asymptotic approximations to B, ,(a; & 44) 
are not always trustworthy for every value of £ Consider, for instance, local 
asymptotic power approximations when ¿ is an infinite dimensional parameter. 

The simulation estimate of the power £,, oi % § 84) is Bn, (a5 É, 6). Evalua- 
tion of 8, ,(a; £, @,) typically requires Monte Carlo simulation of the distribution 
T, — d,(a; &o,9,), the simulation samples being drawn from Pga, For each 
sample X* = (X\*, X¥,..., X7) drawn from Py, the critical value 
dala; £,4,(X")) must itself be recalculated, usually by nested Monte Carlo 
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simulation of the distribution K (£5, 6 A). If ¿ is a euclidean parameter and 
other requirements to be specified in Theorem 2.2 are met, then 


(1.7) lim sup |Bn, ola; £,8,) — Bn, g(a; E, 0a) |= 0 


with Pe -probability one. If ¢ is an infinite dimensional parameter, a uniform 
convergence result slightly weaker than (1.7) can still be proved (Theorem 2.3). 

Another estimate of 8, ,(a; È, 04), which relies on asymptotic constancy of the 
critical value d,(a;&>,6,), is K,,rLd,(a; €o Ôn); £, 8,]. This computationally 
simpler estimate also converges to 8, ,(a; £, 84), uniformly in $, under conditions 
to be described in Theorems 2.2 and 2.3. Evaluation of K, r[d,(o; &, ĝa); & 41 
typically requires the initial calculation of the critical value d,(a; £o, 0n) and 
Monte Carlo simulation of the distribution K,, 7(&, 9,). 

Section 2 of this paper describes a numerical study wherein the actual power 
function of the bootstrap t-test was compared with the two power function 
estimates described above and with the power function of the classical t-test. 

While the function 8, ,(a; *, 04) is estimable, as indicated above, it does not 
seem possible to estimate 8, ,(a; $4, 04) itself. Suppose Ê 6.) are consistent 
estimates of (£4, 6,4). In general, 8, ,(a; £,,9,) does not converge to 
Bn, o(%5 $a» a). For example, consider the following case: P,, is the N(é, 0) 
distribution; (£,,6,) are the usual estimates of mean and variance; T, is the 
t-statistic; H„ o is the hypothesis that 4 = 0. Small perturbations in £, affect 
the value of £8, (a; §4, 8,) far more than do small perturbations in the nuisance 
parameter @,. R 

Equation (1.7) immediately implies that £,, ,(a; o, 6,) is a consistent estimate 
of the actual level of the test »,. Moreover, suppose we are interested in 
the performance of 9, against alternatives H,, ¿ indexed by  € =,, where =, = 
{ġe EH: m ($, o) = €} and {e,} is a sequence of positive constants tending to 
zero in such a way that inf{f, ,(a; £, 44); € © =,} has a limit in (0, 1). (Under the 
assumptions for Theorem 2.2, e, = n` e will do.) Because of (1.7), 
inf(B, o(a; £, 94); È € En} is estimated consistently by inf(B, ,(a; £, 6,): £ € =,}. 
Thus, asymptotically correct comparisons between tests based on test statistics 
Tanı and T), . can be made by referring to the estimated level and the estimated 
minimum power over =, of each test. 

Implied by this result is a technique for constructing adaptive tests based 
upon a finite collection of tests P, 1, Pan,27+- -> Pn, a Of asymptotic level a: use the 
test ~,, for which inf{£,, oh % é, 6); ¿ e =,} is greatest. In view of the preced- 
ing paragraph, this procedure defines a test »* which has asymptotic level a and 
the property that inf{B, ,«(a; $, 04); E © =,} converges to 
max ,inf{£,, oh E Oa) EEEn) 

A related problem is the estimation of test power in future experiments on the 
basis of current information. Suppose the {X,; 1 < t < n} are observations to be 
taken in a future experiment. Let 6, be a consistent estimate of @, based on an 
independent training sample of size m. Under conditions similar to those for 
Theorems 2.2 and 2.3, $, (8; E, 6.) is a uniformly consistent estimate of 
Ba, (43 $, 94) as min(m, n) tends to infinity. The performance of the test p, in 
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the proposed experiment can therefore be assessed by examining the simulated 
power function £,(a; £, Ân), as in the previous paragraphs. 


The confidence region approach. Suppose R(X, £) is a pivot for , a random 
function depending on X = (X,, Xo,...,X,) and on the parameter ¢. Let 
d, RC, 8) denote P[R,(X%, §)|P%,] and let J, R(x; £, 0) be the corresponding 
survival function, defined as in (1.3). For a € (0, 1), let 


ca pa; $, 0) = inf{x: J, p(x; §,8) < a}, 


1.8 

ae) Cn, ula; €,8) = sup{x: J, p(x; £, 0) 2 a}. 

Suppose (Èn §,) is a consistent estimate of (¢4, 84) and c(a; Ê , 4.) is any 
random variable lying between c, ,(q; aĝ.) and Cn, AO fo 6.) The set 


{$ R, £) < c(a; Èn (A) is a bootstrap confidence region for £, of ostensible 
level 1 — a. Under circumstances to be described in Theorem 3.1, the correspond- 
ing test 


(1.9) V(X) = F o reld esh): 


has asymptotic level a under H„ o The tests p, and y, are related when 
T,CX) = R(X, o), but even then will usually have different critical values and 
therefore different power functions. 

The power of the test y„ against the alternative H, ,, 


(1.10) Bn, ya; $, 84) T Pro [R,(X, $o) > cla; Ên 6,.)], 


can be estimated by £, ,(a; £, 6,). Let K„ n(§, 9) denote £[R,(X, £o)|PZo] and 
let K,, p(x; $, 0) be the associated survival function. An alternative estimate for 
Ba, (os È, 84) is Kn plena; £, Ô); £, 6,]. Note that the critical value c,(a; &, 6,) 
appearing in this power estimate varies with ¢. Evaluation of 
K,,, rlen(as &, 6); é, 6) generally requires Monte Carlo simulation of J, ,(&, 6), 
to obtain c,(a; §,4,), and of K„ p(é,6,). Evaluation of £, (a; § ĝ,) typically 
requires a nested two-stage Monte Carlo simulation. The convergence of 
Ba, yo €,8,) and of Kn, alc,(a; E, Ôn); £, ĝa] to B,, ya; £, 04), uniformly in &, is 
the subject of Theorems 3.2 and 3.3. 


REMARK. The results in this paper do not contradict those of Bahadur and 
Savage (1956) because level and power are defined over a smaller model here. The 
level of p, is Es o, (Pn) for us, but is sup, E; 9(p,) for Bahadur and Savage. 
Similarly, the power of p„ is E; a (Pn) for us rather than sup, E; 4(¢,,). Consider- 
ing test performance only over the distributions in the family {P7 € © E} is 
reasonable because the actual distribution Pf » of the sample falls within this 
family. 


2. The test statistic approach: asymptotics and examples. Do the tests 
P, and y, defined in Section 1 have approximate level a? Are the associated 
power function estimates consistent, uniformly in ¢? This section addresses these 
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questions for the test p„; Section 3 does the same for y„. The notation of Section 
1 is retained throughout. All theorem proofs are deferred to Section 4. 

The first theorem gives sufficient conditions under which the asymptotic level 
of p, is a. 


THEOREM 2.1. Suppose the following requirements are met: 


A.l. limpo PE o im, 6,) > €] = 0 for every positive e. 
A.2. If {6,} is any sequence such that {(Ẹo, 8n) € Q} and lim, _, .6,, = ba, then 
K n r($0, 0n) converges weakly to a unique limit distribution Kr(£o, 84). 


Let K7(x; £o 04) be the survival function associated with K(£,, 84) and let 
dila; 5,04) = inf{x: Krz; £5, 04) < a}, 


2.1 
CD d,(a; £o, 84) = sup{x: K7(x; &, 64) = a}. 
Then 

K7[d,(a; £o, 44); Eo» bal < lim inf $, (a; Žo» 94) 
(2.2) < limsup 8, o(a; £o, 04) 


n> æ 


< Ky[d,(a; £o, 84) — ; £0; bal- 
If Ky(x; £o, 9,) is continuous in x, then 
(2.3) im Baol & $o 6,4) oy Es 


If the convergence in A.1 is uniform over m,-compacts of ©, so is the 
convergence (2.3) to asymptotic level a [cf. the derivation of (4.6) in the proof of 
Theorem 2.2]. Each example in this section exhibits this type of uniform conver- 
gence. 


EXAMPLE 1: Minimum distance tests. Suppose {F}: 0 € O} is a parametric 
family of c.d.f.’s on the real line, @ being an open subset of R*. The null 
hypothesis H„ o asserts that the observed random vectors {X,;1 < i< n} are 
iid. with c.d.f. Fy , the value of 6, being unknown. Let Ê, be the empirical c.d.f. 
and let || + || denote supremum norm. Consider the test which rejects H, o if the 
statistic T, = n/*inf, - 9||f', — Fl] is sufficiently large. Bootstrap critical values 
for the test can be found as in (1.6), by identifying P; „o, With the distribution 
determined by F, ; the definition of £, is arbitrary here. 

Conditions A.l, A.2 for Theorem 3.1 will be verified under the following 
assumptions on the parametric model, which are made for every b € 9: 

Identifiability. For every neighborhood N of 6, inf{||F, — Fo l; 6 £ N} > 0. 

Continuous norm differentiability. There exists a k X 1 vector function no, 
such that the components of 4, are bounded, ||Fọ — Fe, — (8 — 6)’n9,|| = 
0(|9 — A), and lim _,9,/lN¢ — N¢,|| = 0. 

Nonsingularity. The components of nọ, are linearly independent. 


156 R. BERAN 


Let G be a c.d-f. on the real line. By an argument similar to that in Pollard 
(1980), 


(2.4) inf ||G — Fall = inf IG — Fy, — t’ng,|| + o( |G — Fa, ll) 
bea teR* 


as ||G — Fy || tends to zero. 

A.l. Let m, be euclidean metric on R* and let 8, be a minimum distance 
estimate of 6,, satisfying the requirement Ê,- Fp || < infs cellfn — Foll + ni 
It is well known that this choice of ĝ, is consistent under Fy 

A.2. Let {6, € ©} be any sequence which converges to 6,. Let B, denote the 
empirical Brownian bridge based on n ij.d. random variables which are uni- 
formly distributed on (0,1). Let B denote the Brownian bridge process. Since 
Pin R — Fa JPE 0,1 = -ZLB - Fy], it follows from (2.4) and the assump- 
tions on the parametric model that K, 7(£, bp) = [m inf colh — 
Fall PŽ, 9, ] converges weakly to K7(£o, 84) = -ŻÜnf;|B - Fy, + t’no, Il 

Thus, the bootstrap test based on the minimum distance statistic T, has 
approximate size a, in the sense of (2.2). 


The next theorem establishes uniform consistency of the two power function 
estimates for »,. A key assumption is finite-dimensionality of the parameter ¢. 
Conditions B.2 and B.3 in the statement of Theorem 2.2 imply the weaker 
conditions A.1 and A.2 used in Theorem 2.1. 


THEOREM 2.2. Suppose = is R* and the following requirements are met: 
B.1. Pe a Lim 6 = 8, = 1. 


B.2. lim, . .8up,P'4,[m2(4,, 94) > £] = 0 for every positive e. 

B.3. If {(é,,9,) € Q; n = 1} is any sequence such that lim „on? En — o) = h 
for some h € R* and lim,,., obn = 94, then Ky, 7(En 9) = KEX Eo, 04), a 
limit distribution which is continuous and does not depend upon the 
particular sequence {({,, 9,,)} chosen. Moreover, KS., 04) has a strictly 
monotone survival function. 

B.4. If {(€,,9,) © Q; n = 1} is any sequence such that lim „on En — & | = œ 
and lim, ..,.6, = 94, then lim, ...K,7(x; &,,9,) =1 for every finite 
real x. 


Then 
(2.5) PE. im sup|K,,r[da(a; to ĝ,); £,6,] a Br, o 3 $, 8,)|= o| =i 


Suppose B.2 is strengthened to 


B.X. If (En 8) E 8; n> 1} is any sequence such that lim, _, 6, = ba, then 
lim, > o PE, 0 [7 2(G,, 84) > £] = 0 for every positive e. 


Then also 


(2.6) PE, | Jim sup Ba, olas &4,) = Bag &@a)|= | “1. 
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The suprema in (2.5) and (2.6) are taken over {{ € =: (£, 04) € Q and (&, Â) E 
Q}. Similarly, the supremum in B.2 is taken over {$ € =: (§, 04) € Q}. 


EXAMPLE 2: Testing a mean. Suppose the {X,;1 <i <n} are iid. random 
k x 1 vectors with c.d.f. G(x — £4), where G, has mean zero and nonsingular 
covariance matrix =(G,). Both , and G, are unknown. Consider testing the 
null hypothesis , = $ versus the alternatives £, # o, the c.df. G, being 
regarded as a fixed unknown nuisance parameter. The test statistic to be used is 
T, = |S nX, — £o), where |*| is any norm on R* and X,, S, are the 
sample mean and sample covariance matrix respectively. When the norm |- | is 
euclidean, T? is a multiple of Hotelling’s T?-statistic. 

Let 5, denote Lévy metric and estimate G, by G,, the empirical c.d.f. of 

the residuals {X, — X,;1<i<n}. For any kXk matrix A, let |A|= 
sup{|Ax|: |x| = 1}. Define the metric m, by 
(2.7) m(G,,G4) = 6,(G,,G,) +|2(G,) - =(G,)]. 
We will verify that conditions B.1, B.2’, B.3, and B.4 are satisfied in this example. 
Consequently, the bootstrap test ọ„ defined in (1.6) has asymptotic size a 
(Theorem 2.1) and both bootstrap power function estimates are uniformly cou- 
sistent (Theorem 2.2). 

B.1. Without loss of generality, because of location invariance, take £, = 0. 
Let F, be the empirical c.d.f. of the {X,1 < i < n}. With Põe, probability one, 
8,(G,(x), G(x + X,)] < |Ê, — Gall > 0 by the vector version of the 
Glivenko-Cantelli theorem; and 8, [G,(x + X,), G4(x)] > 0 and =(G,) > 2(G,) 
by the strong law of large numbers. Condition B.1 follows, in view of the 
definition (2.7) for my. 

B.2’. Let G, converge to G, in the metric m, and, without loss of generality, 
take €,= 0. In Pa probability, 5,1G,(x), G(x + X,)] < A, - G,|| > 0 by 
the Dvoretzky—Kiefer-Wolfowitz inequality; and 6,[G,(x + X,), G,(x)] > 0 
and >(G,) — 2(G,) by a triangular array version of Khintchine’s weak law of 
large numbers. 

B.3. Let {(&,, G,)} be any sequence such that lim, _,,./*(é, — £o) = h and 
lim, . ./2(G,, G,) = 0. The latter convergence is equivalent to saying G, = G, 
and 2(G,) > &(G,4). Then K, 7(&,,G,) > £[|Z + 2°'7(G,)Al] = 
Ko G4), where Z is a k X 1 normal random vector with mean zero and 
identity covariance matrix. Indeed, Y[n/?(X, — E IPE a] = NCh, 2(G,)) by 
the Lindeberg central limit theorem; and S, converges in Pf a probability to 
2(G,), by Khintchine’s weak law of large numbers. The limit distribution 
KE, G4) does not depend on £, in this example because of location invari- 
ance. The limit distribution is continuous because the set {z € R*: |z| = e} has 
Lebesgue measure zero for every norm |*|on R*. Thus B.3 holds. 

B.4. Let {(&,,G,)} be such that lim„„ nln — ol = œ and 
lim, — «o7%o(G,, Ga) = 0. The inequality |x| < |A|-|A74x| implies 


TeS ee ky) tS ae, 


(2.8) a 
2 ISA IE, — Eo — [S27 7(X, — €,)I- 
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As in the previous paragraph, L[S; Pn (Xa — En) PE a] > £(Z) and S, > 
2(G,) in P? a probability. In view of (2.8), condition B.4 holds. 

A numerical study of the univariate bootstrap t-test and of the two associated 
power function estimates yielded some additional information. In this study, the 
data was taken to be normally distributed with unit variance. The null hy- 
pothesis value of § was £ = 0; the nominal test level a = 0.05; and the sample 
size n = 20. Table 1 compares the power function of the bootstrap t-test with the 
power function of the classical t-test under standard normal shift alternatives. 
The critical value of the bootstrap test was obtained from 200 bootstrap samples. 
The power of the bootstrap t-test was approximated by Monte Carlo simulation, 
using 1000 standard normal samples. Even at sample size 20, the power function 
of the bootstrap test is almost indistinguishable from that of the classical ¢-test. 

For a single standard normal sample of size 20, Table 2 records the two power 
function estimates described earlier in this example and the normal approxima- 
tion to the power function based on the sample standard deviation. The calcu- 
lation of 8, ,(a; $, G,) used 200 bootstrap samples for the critical value loop and 
1000 bootstrap samples for the outer loop. The calculation of 
K,, rld,(a; £o, G,); £, Gn] used 1000 bootstrap samples for both the critical value 
and for K,, r(&, G,). Two points stand out: 


(a) The three power function estimates in Table 2 are roughly similar, especially 
when |é] is near zero or is large, even though the second estimate is more 
asymmetric. ` 

(b) Each of the estimated power functions usually underestimates the actual 
power function reported in Table 1. 


Point (a) is not surprising, since each of the power function estimates converges 
uniformly in ¢ to the actual power function as n increases. Point (b) is attribut- 
able to the particular N(0,1) sample of size 20 from which the power function 
estimates were computed; the estimated standard deviation of this sample 
happened to be 1.140, which is larger than the population standard deviation. 


EXAMPLE 3: Testing correlation. Suppose the {X,;1 <i <n} areiid.2 x1 
random vectors with c.d.f. GALD Has 014, %,4)x], where 


0) 0 
go, (1- E 0, 


and G, has mean zero, identity covariance matrix, and finite fourth moments. 
The covariance matrix of each X, is therefore 


(2.9) D(§, 01, 02) = | 





2 
(2.10) Z(éa; Tia» 024) = | HA Sarassa | š 
EA0024 924 
The values of the correlation $4 € (—1,1), of the standard deviations o,, > 0, 
o4 > 0, and of the c.d.f. G} are unknown. Consider testing the null hypothesis 
a = o versus alternatives $4 + £o, with 6, = (0,4, 624, GA) regarded as a fixed 
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unknown nuisance parameter. The test statistic to be used is T, = n/t — fol, 


where Ê, is the sample correlation. 
Let 


(2.11) S, = 





2 

Sh, Ín,12 
2 

Sn, 12 Sa, 2 


be the sample covariance matrix. Take Ĝ a(x) = Ê IDE, 8 8n, v §n,2)*] as the 
estimate of G, and let 6 = (En, En, 2 Ĝ,„)- Suppose that Br, (Ga) = 
Jxïixz dG,(x,, x3) is finite whenever r,, "p are nonnegative integers such that 
r, +r, < 4. Define the metric m, by 


(2.12) m,(6,, 64) = 8(G,, G,) a L len (Ân) = Hy, (Ga) |. 

n+nRs4 
We will verify that conditions B.1, B.2’, B.3, and B.4 are satisfied in this example, 
under the additional assumption that = = [— b, b] for some b in (0,1). 

B.2’. Let {€,,} be any sequence in [ — b, b] and let {6, = (0, 1, On, 2 Gn )} be any 
sequence which converges to 6,. Suppose B.2’ does not hold. By going to a 
subsequence, we can assume that, for some positive e, the sequence 
(Fe, a [m8 84) > e]} remains bounded away from zero and lim „wn = pi € 

]. From Khintchine’s weak law of large numbers, Ê, — €* and s, Oa 
in Pr „probability. By the definition of Ân 


ô (Ân, Ga) 
(2.13) = ôL Ê, Al D(E,, 8 Sn, vS n, 8,.0)% i G, [D (En, On, op, 2) DEn, Sno 8a,2)2]} 


+6,{G,|D ae (ee op 2) D( Ens 8n, 8n,2)x] , G,(x)}. 


The first term on the right side of (2.13) is bounded above by ||/f(x) — 
G [D (Ens On, 1» n,2)*]|, which converges to zero in Pi a, -probability. The sec- 
ond term on the right side of (2.13) also tends to zero in Pf , -probability 
because both D and D~! are continuous at (é*, 6,4, 024). Moreover, hn, Gn )> 
H,,,r(G4) in Pf o probability because p, AB) converges to the corresponding 
n of Pe, by the weak law of large numbers an because 
DEn Sn Sn,2) + D- XE*, 014, 024). It follows that m,(6,,0,) >O in 
Pe by -probability, éontradicting the supposition to the contrary madè at the start 
of this argument. 
a 1. The verification of B.1 is similar to that of B.2’. Note that, with 
9,-probability one, im, ll F(x) — G,LD~"(E4; 014; %2,)]|| = 0, by the 
Ot cack theorem for two-dimensional c.d.f.’s, and 


limp l n? Sn 8,2) = (ża; Tia 04) 
by the strong law of large numbers. 
_ B38. Let {(£,, 8,)} be any sequence such that Him p o (En — $o) = h and 
lim, ofn = 94. The latter convergence is equivalent to saying G, = Ga and 
Bn, r{G,) > Hy, (G4) whenever r, + r < 4. In particular, lim, _, ..9 On = Ca for 
i = 1,2. By the Lindeberg central limit theorem, Lin HS, — 
Z(Ens On, On, 2) HPE 9 |] converges weakly to a singular normal distribution with 


SIMULATED POWER FUNCTIONS 161 


mean zero and covariance structure depending only on (o 6,4). Since 

= S$, 12/(Sn,18n,2) i8 a continuously differentiable function of S,, 
one —€ IPE: a] converges to a normal distribution with mean zero and 
variance 07(£,,6,), say. The weak limit of {Kn r($n 9,)} is therefore 
L Ilolo 9,)Z + Al], Z being a N(0,1) random variable. This limit distribution 
fulfils the requirements of B.3. 

B.4. Let {(£,, 9,) E 2} be any sequence such that lim, , .n'/|é, — Eol = 
and limn obn = = b,- Then lim, Kn, 7(x3 En On) = 1 for every finite ve x. 
not, by going to a subsequence, we can assume that, for some positive x, 

K,,, r(x; En 9,) does not converge to one while lim „$n = §* € [— b, b]. From 
the argument in the previous paragraph, #[n'/7(£, — ¢ IPE 61 = 
N(O, 0°(€*, 0,)). Since 


(2.14) Ta = WE, — Eol — nEn — Enl 
it follows that lim „oK n, 7(x; §,,9,) = 1 for every positive x, contradicting the 
initial assumption to the contrary and thereby establishing B.4. 


The conditions for Theorem 2.2 are adapted to euclidean = and to local 
asymptotic power calculations. The next theorem identifies more general cir- 
cumstances under which the two bootstrap power function estimates approxi- 
mate 8, (a; £, 84). Let 8, denote the Prohorov metric on the extended real 
line R. 


THEOREM 2.3. Suppose the following requirement is met in addition to 
conditions B.1 and B.2 of Theorem 2.2: 


Cl. If ((€,,9,) EQ; nz p igs any PA such that limb, = 6,4, then 
lim , > ð PLEna, Tln 6 ny K n, TEn 9,)] = 


Then the following assertion is true with PẸ -probability one: for every 
sufficiently small positive e, there exists no(£) ah ‘that for every n > n,(e), 


Kp rldn ula — e; £o, ô.) + e; €,6,] — 

< Ba ola; È, 04) < Karl dn, ila + e; £o, Âp) — E; £, Â] + E 
simultaneously for every possible £. Suppose B.2 is strengthened to B.2’. Let 
Bn, », 1603 £, 0, 2) = Pha [T, > dn, ua — & £0, 8,) + e] — e, 
Ba,» who; £, 8, £) = PP ,|T, > dp, Lla + E; &, 6,) — e] te. 
Then (2.15) may be replaced by 
(2.17) Bag, 15 E, ĝa €) < Bn, (a5 £, 84) < Bao, ua & Ôn, £). 


(2.15) 


(2.16) 


This theorem asserts that certain small perturbations of either bootstrap 
power function estimate will bracket 8, ,(a; $, 04), uniformly in $, provided the 
sample size n is large enough. While the conclusions are weaker than those of 
Theorem 2.2, Theorem 2.3 can be applied to examples where the parameter ¢ is 
infinite dimensional. 
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EXAMPLE 4: Testing for symmetry. Suppose the {X;1 <i< n} are iid. 
random variables with unknown c.d.f. F,. Set &,(x) = 4[F,(x) + F,(-—x) - 1] 
and @,4(x) = 21 F4(x) ~ Fy(—x) + 1]. Then a(x) = a(x), 
6,(x) + 6,(-x) =], b, i is the c.d.f. of a random variable distributed symmetri- 
cally about the origin, and F,(x) = €,4(x) + 6,(x). Consider the problem of 
testing the null hypothesis ¢, = 0 (that is, the null hypothesis that X, is 
symmetrically distributed about the origin) versus the alternatives £, + 0. The 
nuisance parameter §, is unknown but fixed by the experiment. 

The space © consists of all c.d.f.s associated with random variables symmetri- 
cally distributed about the origin. The space = consists of all real-valued 
functions on R such that &(x) = &(—x) and ||é|| < 27+. Let 


E(x) = HAG + #,(—x) - 1], 
G(x) = $[F,(2) — (x) + 1]. 


The test statistic to be used is T, = n'/*||£, |. We will show that the conditions 
for Theorems 2.1 and 2.3 hold when the metric m, on © is defined by 
(2.19) m,(4,,94) = 118, — ll 

B.1. By Glivenko-Cantelli, fim, if, = Fall = 0 with PẸ a -probability one. 
This convergence and the definitions of 6, , 9, imply B.1. 

B.2’ and A.1. Let {(£,, 8,) E€ Q} be any sequence such that lim oon = ĝa. 
Under P? g , the empirical c.d_f. Ê, may be represented as F, 6,7 nB, Feo. 
where E a(x) = &(x) + &(x) and "B, is the empirical Brownian bridge process. 
Thus, from (2.18), 


(2.20) Â (x) = 6,(x) + in| B,- g(x) — Ba Fe, o(-*)], 


which implies B.2’ and therefore A.1. 
C.1. Let {(€,, 0n) E Q} be any sequence such that lim 6, = 04. Let 


(2.21) 8,(£,4) =||3[B,- Fe o(x) + B, - F,o(—x)] + nla). 


Since T, = lnn- En) + nl it follows from (2.18) that K„ 7(£,,9,) = 
£[S, (En, 6,)] and K,, r(&,, 94) = FIS, En 44)]. By Skorokhod’s theorem, there 
exist versions of {B,} and of the Brownian bridge B such that lim, iB, - BI = 
0 w.p. 1 and B has uniformly continuous sample paths. Since || F; 9 — Fg oll = 
8, — ball > 0, the corresponding versions of {S,(¢,, 8,)} and (5, (En A have 
the property that lim, _. olSalEns ba) — S,(é,, 9,)] = 0 w.p. 1. This convergence 
implies C.1. 

A.2. Let (6, E O} be any sequence such that lim, _, ôa = 9,. From the previ- 
ous paragraph, K,, (0, 6,) = [|B - 4,(x) + B- 4,(—x)|[], a limit law satisfy- 
ing the requirements of A.2. 


(2.18) 


EXAMPLE 5: Testing for multivariate normality. Suppose the {X,;1 < i < n} 
are iid. p X 1 random vectors with c.df. G [2] (x — u,4)], where p4 is a 
p X 1 vector, 2, is a pX p positive definite symmetric matrix, and G} is a 
continuous c.d.f. on R? with mean zero, identity covariance matrix, and finite 


a: 
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fourth moments. Identify $4 with G, and @, with (m4, 4). Consider the 
problem of testing the null hypothesis G} = ®, where ® is the standard normal 
c.d.f. on RP versus alternatives G, + ®. The nuisance parameters (p4, 2,4) are 
unknown but fixed by the experiment. 

Let X,,,S, be the sample mean vector and sample covariance matrix, respec- 
tively. Set 8, = (X,, S,) and G(x) = F(S}x + X,), where Ê, is the empirical 
c.d.f. The test statistic to be considered is 
T, =n F(x) - [S(x — X, 
ea) p= nF (x) - o[8,17(x - X,)] | 


= n4, — ®Il. 


The space @ consists of all pairs (u, 2) such that p isa p x 1 vector and È isa 
p X p positive definite symmetric matrix. Let m, be euclidean metric on 9. For 
any c.df. G on R?, let (ri %,..-, T3 G) = JxI... x dG(x,,...,x,). Define 
the metric m, by 


(2.23) m,(G,, Ga) = 1G, =< Gall + 2, Bro- Tp) G,) 7 B(T. Tps G,)|, 


the sum being taken over all sets of {r; 1 <j <p} such that every r, is a 
nonnegative integer and L?_,r, = 4. Let = be any set of continuous c.d.fs G on 
R” such that the mean of G is zero, the covariance matrix of G is the identity 
matrix, = contains ® and G,, and = is compact in the metric m,. Then, the 
conditions for Theorems 2.1 and 2.3 hold. 

B.1. By the strong law of large numbers, 6, = (X S,) converges with PẸ er 
probability one to 4, = (#4, 24) in the metric my. 

B.2’ and A.1. Let {(G,, 6,) E Q}, where 6, = (Hn 2,), be any sequence such 
that lim „Hn = Ha and lim, ,,2, = 24- By Khintchine’s weak law of large 
numbers, 6, > 6, in PẸ 4 -probability. This implies both B.2’ and A.1. 

C.1. Let {(G,, 6,) € &} be any sequence such that lim, ,,,0, = 94. Since = is 
compact, we may assume without loss of generality that G, converges to some 
c.d.f. G* € & in the metric m,. Evidently 


Kn, (Gns 0n) = L || n [F,( 24x + un) - G,(z)]| 
(2.24) +n’? [O(x) — (S (2/x + pn- X,)}] 
+n?[G,(x) — ©(x)] | LEE]: 


Standard weak convergence arguments based on the Lindeberg central limit 
theorem establish the following fact: there exists a gaussian process W(G*, 04) 
such that 





L |? R(E + un) — G,(2)] 
+n'?1 (x) — {8717 (ax + pp- X,)} ||P... | 
converges weakly to £[W(G*, @,)] on C(R”). Consequently, 
(2.25) Jim 8,{Kn,r(Ga» 6,), £[||W(G*, 04) + n(G, - ©)||]} = 0. 
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Since (2.25) remains true if 6, is replaced by @,, the fulfilment of condition C.1 
follows. 


3. The confidence region approach: asymptotics and examples. The 
main difference between the tests p, and y,, defined in (1.6) and (1.12), respec- 
tively, lies in their critical values. The critical value of y, depends on 6, and the 
null hypothesis value {,, while the critical value of y,, depends on estimates È, 
and 6, of both parameters. As a result, the asymptotic theory for y,, described in 
this section requires conditions on £, as well as 6,. 


THEOREM 3.1. Suppose the following requirements are met: 

D.1. lim, ..P2.o,[ (Ens to) > €] = lim, PE o, [MAb 94) > e] = 0 for 
every positive €. 

D.2. If {(é,:9,) E Q} is any sequence such that lim, ,,g, =) and 
lim ,„— obn = 94, then J, AEn On) converges weakly to a unique limit distri- 
bution J Èo, 94)- 

Let JÁ x; Eo, 44) be the survival function associated with JK Eo, 8,4) and let 


cila; £o, 0a) = inf{x: Jp(x; &, 94) < a}, 


3.1 
(3-1) cula; $o, 04) = sup{x: Jp(x; $o, 04) = a}. 
Then 
Iple,(a; bo, 94); bo; b4] s lim inf 8, (a; bos 64) 
(3.2) < limsup, (a; Eo, 04) 


s Jp[czla; $o 04) ear bo; ba). 
If Jp(x; £o, 04) is continuous in x, then 
(3.3) Jim By, (a5 os ba) = &. 


EXAMPLE 6: Testing mean orientation. Let Ẹ = (€,, $2, &3)’ be a unit vector 
in R? other than the vector e = (1,0, 0)’. Let | * |, denote euclidean metric. Define 
the orthogonal matrix 


$i $o £ 
-1/2 _ 
(3.4) o(g)=| 0 (1-9?) ea (1-2) "7 I, 
(lake Se ee SED ae 
noting that O(£)§ = e. Suppose the {X,;1 < ¿ < n} are iid. random unit vectors 
with c.d.f. G,[O(é,4)x], where G, is the c.d.f. of a singular distribution in R? 
whose support is the unit sphere and whose normalized mean vector 
Eg {X \/1Ec (X)|: = e. Then the mean orientation of X, is E(X) E(X la = $4- 
Consider testing the null hypothesis £, = o # e, with G, being regarded as a 
fixed unknown nuisance parameter. (A different choice of the orthogonal matrix 
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O() handles the case £, = e.) Let Ê, = X,/|X,|,, the unit vector in the direction 
of the sample resultant. The pivot to be used is R(X, £) = 27 lg, — &3 = 
n(l — §7£). 

The space = = {x € R°: |x|, = 1} — e while @ consists of all c.d-f.’s supported 
on the unit sphere whose mean orientation vector is e. Take G(x) = É [OÈ], 
which is the empirical c.d.f. of the rotated sample {O(£,)X,;1 < i < n}. Let m, 
and m, be, respectively, euclidean metric and Lévy metric. In this example, G, 
and G, play the roles of 6, and 6,, respectively. 

D.1. The weak law of large numbers and continuity of O(£) at € = $ imply 
that condition D.1 holds. 

D.2. Let {(€,,G,) © Q} be any sequence such that lim „$n = o and G, => 
Ga. Let F: o(x) = G[O(£)x] and designate the mean vector and covariance 
matrix of F; a by u(f,G) and X(f, G), respectively. Since F, ¢ = Feo, and 
O(€) is continuous at $ = $p, 


£\n'(X, — pln G,))|PE.6,| = N(0, 2(f, G,))- 


Thus, J, En Oa) = Lin, — £,)|PZ, a] converges weakly to a limit distri- 
bution which is normal with mean zero and covariance matrix depending on £, 
and G,. Consequently, D.2 holds. 

Because P; ,&. is the empirical distribution of the sample in this example, the 
reparametrization by £ and G is not necessary in constructing the bootstrap 
confidence region which generates the test. (For instance, Theorem 1 in Beran 
(1984) could be applied.) However, the (£, G)-parametrization is useful in estimat- 
ing the power of the test, a question to which we return after the next theorem. 


THEOREM 3.2. Suppose = is R* and the following requirements are met in 
addition to conditions B.1 and B.2 of Theorem 2.2: 


E.1. lim, .o8UP apy < PP aglen — $| > e = 0 for every positive e and c. 

E.2. If {(&,,9,) E Q} w any sequence such that im, n {En — £o) = h for 
some h E€ R* and lim,,.. 9, = 94, then K, Ens 9) = KEE, 04) and 
Jn, KEn» 6n) = Ip Eo, 94). Both limit distributions are continuous and do not 
depend upon the particular sequence {(£,, 8,)} chosen. Moreover, J(£o, 44) 
has a strictly monotone survival function. 

E.3. If ((E,)9,) E Q} is any sequence such that im, „nEn — Eol = 0 and 

i 8, = b,, then lim Ka Ax; Ens 8a) = 1 for every finite real x and 


n= on n> won, 


the distributions {J,, ,,9,); n 2 1} are tight. 
Then 


(3.5) PE o| im sup|Kn, plena; £, ô); £8] — Ba (a & 0a) = 0) =1. 


Suppose B.2 and E.1 are strengthened respectively to condition B.2' of Theorem 
2.2 and to 


E.V. If {(€,,9,) EQ} is any sequence such that sup,n’/*|t, — {|< € and 
iM , ofn = 94, then lim, Pir 6 [l§, — Eol > £] = 0 for every positive e 


and c. 
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Then also 


(36) Pao) im sup |B, (a; €, 0,) — Pa, ylas € &a) | = ol =R 


EXAMPLE 2 continued. Consider the pivot R (X, £) = |S; nX, — £). 
The test statistic T, discussed earlier in this example coincides with R,„(X, £o). 
Because of location invariance, 


J, R(E, 0) = 2 [|S E,Z] = Kn, r(o, 0). 


Thus, c,(a; , 8) does not depend on ¿£ and coincides with d,(a; , 8). The tests 
Pa and yw, are identical. iia K,, r(é,9) = Kn, R(E, 9), the bootstrap power 
estimates K, 7[d,(a3 Ey ôn); £, 4,] and Kn, ale,(a; é £ 6,); £, ĝ,] coincide, as do 
the estimates f, ,(a; 6, 8) )and B, (a; $, ô). Theorem 3.2 is not needed for this 
example. 


EXAMPLE 3 continued. Consider the pivot R,(X, £) = n/\£, — £l, where £, 
is the sample correlation. Retaining definitions made in the earlier discussion of 
this example, we will show that conditions D.2, E.V, E.2 and E.3 of Theorems 3.1 
and 3.2 hold. Condition D.1 for Theorem 3.1 follows from B.2 and E.1. 

E.V. If {(é,, 9,) E€ Q} is any sequence which converges to ($o, 84), £, con- 
verges in P? 4 -probability to $o, by Khintchine’s weak law of large numbers 
applied separately to the sample covariance and the two sample variances. In 
particular, E.1’ holds. 

E.2. Let {(é,, 8,) E Q} be any sequence such that lim, ,.n'(é, — ĉo) = 
and lim,» 9, = 94. Since R(X, £o) equals the test szatistic T, discussed aie 
in the example, B.3 implies the desired weak convergence for (K, n, Ens 9,)}- The 
argument for B.3 also shows that J, (£,,6,) = F [lolo %)Z|], where Z is a 
N(0, 1) random variable. 

D.2 is checked like the second part of E.2. 

E.3. Let ((£,, 9) E 2} be such that lim, onlin — ol = œ and lim 0, = 4, 
Condition B.4 implies the first part of E.3 because R(X, £o) = T,. Suppose the 
distributions {Jn (£,,9,); n 2 1} were not tight, hence not relatively compact. 
By going to a subsequence, we can assume that the {Jn A£,,9,)} do not 
converge to a limit distribution while lim, „ê, = 7 € [— 5, b] exists. On the 
other hand, {J, A£,,9,)} converges weakly to Y[|o(¢*, 4,)Z|] by the Lindeberg 
central limit theorem, as in B.3. The contradiction establishes E.3. 


EXAMPLE 6 continued. Let = be any compact subset of {x € R$: |x| = 1} — e. 
Retain other definitions made in the earlier discussion of this example. An 
argument similar to the one used in example 3 shows that the conditions for 
Theorem 3.2 are satisfied in this case as well. 
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THEOREM 3.3. Suppose the following requirements are met in addition to 
conditions B.1 and B.2 of Theorem 2.2: 


F.1. lim, ., Sup, Pi", [m (Ên, £) > e] = 0 for every positive e. 

F.2. If {(€,,8,) €E 9} is any sequence such that lim, ...0, = 94, then 
lim, Spl Kn, Ren 0n), K,, Ren 6,)] = 0. 

F.3. If {(&,,8,) €Q} and {(Èn 0n) EQ} are any two sequences such that 

lim, obn = lim,..9, =_9, and lim, _,.,m,(é,,§,) = 0, then 


lim p + Spl In, Èn On)» In, Ens 9,)] = O. 
Then the following assertion is true with PẸ 4 -probability one: For every 
sufficiently small positive e, there exists n,(e) such that for every n > n,(e), 
Ky, plcn wo — e; & 6,) + e; £, 6,] — e 
(3.7) < Bn las È, 4) 
< Kp rlCa la teg, Â) — 6 8, Â] +e 


simultaneously for every possible ¢. Suppose B2 and F.1 are strengthened 
respectively to condition B.2’ of Theorem 2.2 and to 


EV. If {(€,,6,) E Q} is any sequence such that lim 6, =6,, then 


ARO OR 


lim, PE o [mEn n) > e] = 0 for every positive e and c. 
Let 
ag Ba, y, 1605 £, 0, €) = PZ o [R A(X, £0) > Cn, ula ~ £; Ên Ôn) + €] — e, 
Bn, y, u%3 £, 0, €) = Po [ R(X, £o) > cn sla + e; Èn Ân) — e] + e. 
Then (3.7) may be replaced by 
(3.9) Bn, o, i105 E, 8n €) < Bn, las È, 84) < Br, y, was È, Ôn, 2). 


EXAMPLE 4 continued. Consider the pivot R(X, £) = n'/\&, — él, where 
Ê~ ¢ are defined by (2.18) and the surrounding discussion. Let both m, and m, be 
supremum norm metric. The conditions for Theorem 3.1 and 3.3 hold for the 
following reasons: 

F.Y and D.1. Let {(,,8,) E Q} be any sequence such that lim, sbn = ba- 
Under P? a, Ê, has the representation 


(3.10) a(x) =é,(x) + in| B,- Fp o(x) + Ba Fy o(-*)], 


which implies F.1’. Moreover, F.1’ and B.?’ (verified earlier) imply D.1. 

F.2. Since K,, ,(£,9) = Kn r($, 0) in this example, condition F.2 coincides 
with the previously verified condition C.1. 

F.3. Let {(€,,9,) E Q} and {(€,,6,) E Q} be any two sequences such that 


lim „bn = lim „> obn = a and lim,- olfr = gall = 0. Let 
(3.11) V,(é, 9) > 4B, - Fy o(x) + B, F; (x)|. 


168 R. BERAN 


In view of (3.10), J, ($, 8) = £LV,(é, 0)]. Let {B,} be versions of the empirical 
Brownian bridge processes which converge almost surely to a Brownian bridge 
we B. Since lim, .. ollF:,.6, ~ Feall = 0, the corresponding versions of 
(V;(En.9,)} and {V,(E,,6,)} have the property that lim, ...[V,(E,.,) — 
ve 6,)] = 0 with probability one. This implies F.3. 


_ D2. Let {(£,,8,) EQ} be any sequence such that lim,- én = o and 
lim „ofn = 94. By the argument for F.3, J, (&,, 9,) converges weakly to a limit 
distribution 
Info: ba) S g ||B i F,,, 9,(%) +B- F,.a,(~=) ||] 
as required. 


4, Theorem proofs. 

PROOF OF THEOREM 2.1. Similar to Theorem 1 in Beran (1984). 

PROOF OF THEOREM 2.2. Let {@, € O; n 21} be any sequence which con- 
verges to 9,. It follows from B.3, specialized to the sequence {(£, 6,)}, that 
(4.1) Jim d,(a; £o, 8,) = d(a; fo, 04), 
the upper a-point of K(,, 84). 


Let {(&,, On) © Q; n = 1} be any sequence such that lim „on (En — Eo) =A 
for some h € R* and {6,} converges to 6,. From B.3 and (4.1), 


(4.2) dim K, rld,(a; fos On); Ens On] = KEP [d(a; Eo, b4); Eos bal- 

On the other hand, it follows from B.2 and (4.1) that {d,(a; £, ĝ„)} converges in 
PE e probability to d(a; £o, 84). From this and B.3, 

(4.3) Jim Bn, 9 5 on 9,4) = Ki [d(a; fo, 94); os 8a] . 

Combining (4.2) and (4.3) yields 

(4.4) lim sup [Ky r[d,,(05 £o 0n); E, 0n] — Bn, gla; E, 84)| = 0 


mee mA- tse 
for every positive c and for every sequence {6,} converging to 04. 

Let {(&,,9,) € Q; n 2 1} be any sequence such that ee nig, — |= 
and {8,} converges to 6,. The critical values {d,(a; £o, ĝ,)} still Seats in 
ie a, probability to d(a; £o, 04) because of B.2 and (4.1). In view of B.4, both 

lim y+ obn, (8; Én 04) and lim, 0 n, rl d,(a; o 8, ns Sia 6 Al are equal to 1. This 
fact, together with (4.4), implies that 


(4.5) im sup |Kn,rlda(a £o,9,)3&, On] = Bn, o(a; £, 84)| =0 


for every sequence {@,} converging to @,. The theorem assertion (2.5) follows from 
(4.5) and B.1. 
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Let {(€,,4,) E Q; n > 1} be any sequence such that {6,} converges to b4. 
From B.2’ and (4.1), it follows that the critical values {d,(a; €o, 6.)} converge in 
Pf o probability to d(a; £5, 84). Thus, by essentially the same arguments as 
above, 


(4.6) Tim Bn, g( 45 Ens 0n) = K [d(as Eo, 84); Eo» 4] 
if 
Jim ni/(£,— £o) = h; and Tim Bp, ola; Ens Oa) = Lif lim mZ, ~ Eol = 0. 
Consequently, 
(4.7) jim, sup | Bn, plas & 6) = Bn, ola; È 64)| = 0, 


which implies (2.6), in view of B.1. 


Proor OF THEOREM 2.3. Fix a €E (0,1). Without loss of generality, assume 
that the test statistic T, takes its values in [0,1]; if necessary, replace T,, by a 
strictly monotone function of T, to achieve this end. Let {6,}, {6,} be any two 
sequences in © such that lim 6, = lim 6, = 04. First we show that, for 


noon n>wo n 


every sufficiently small positive e, there exists (2) such that 
dp Lla + £; &,6,) —e<d, (a; o n) 
(4.8) < dn, (a5 Èo, 8n) 
< da ula — £; $o 0n) +e 


for every n > n,(e). Indeed, from the definition (1.4) of the upper and lower 
a-points, 


Ka rldn, ula; fo 6,,) + &/2; £o, 6, <a 
$ K„rlda, 1( a3 fos 6,,) a £€/2; bo. 6, | 


for all sufficiently large n. In view of C.1 and the compact support of T, 
lim ô [Kn (0: 6n) Kn, r(€o, 9,)] = 0, where 5, denotes Lévy metric. Hence, 


Ka rld, Aa; Eos On) + €/25 £o, On] 

> Karl dn, ula; bo, 0n) + £; bo, 0n] — € 
for all sufficiently large n. Combining (4.9) with (4.10) yields 
(4.11) K, rldn, A fos On) + £; os Oa] <a +e, 


which implies d, (a; $o, 6n) + € 2 dn, fat 6; $o, 6,) and therefore the right 
half of (4.8). The left half of (4.8) is proved similarly. 

Let {&,,} be any sequence in = such that both {(£,, 6,)} € Q and {(&,, 8,)} E Q. 
We prove next that, for every sufficiently small positive e, there exists n,({e) such 


(4.9) 


(4.10) 
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that for every n = no(e) 
Karl En, ula ~ 8 Eos 9) + & Ens 9] — e 
ae < By, ol 3 Ens ba) < Kar lan, ila + E; Eos Oa) — E; ns 94] + e. 
Let 
A,(e) = {dn (a + E; &,9,) — E 
(4.18) < dn, s(a; o, 8n) < dn, ala; Eo» Ôn) 
<d, ula — E; Eos Oa) + e}. 


Since (6,} converges to 6, in P? » -probability by B.2, there exist versions of 
the (6} such that lim ĝ = 6, w.p. 1 [Wichura (1970)]. For these versions, 


Pilimint, ,..A,(e)] = 1 because of (4.8). Hence lim, ,,,P[A,(e)] = 1, which 
implies 
(4.14) lim PE o {[A,(e)] =] 
noo 

for the original estimates { 6}. Consequently, 
Peg [Tn > da, „(@; £0, 6,)] 

2 Kn rldn ule — €) Éo, 6,) + E; ae 8a] + o(1), 
PE 6, [T,, > dp, (a; bo, 6,)] 

< Kn rlda, (a te; bo, 6,) TE; Suis JN + o(1) 
as n tends to infinity. The definition (1.6) of ẹ„, implies that £, ,(a; En, 84) lies 
between the two probabilities on the left side of (4.15). Hence (4.12) follows. 

The third step is to show that @, can be replaced by 9, in the two bounds in 

(4.12), Since lim „ôL Kn 7(En: bn) Kn, T(En 9)] = 0, by C.1 and the compact 
support of T,, 
K,7ldn, ha TE; os 0n) +e; Sns 6,] 

2. K, ridn fa — 8 fo, bn) + 2e; Ei 6, — €, 
K,rldn, AC: + £; os 6,) — E; fy 6,| 

s Kn rldn, (a + E; os On) — 28; Ens On] + E. 


Combining (4.16) with (4.12) and using the monotonicity of d, ,(a; &, 8), 
dn, <a; §,@) in a yields the following conclusion: For every sufficiently small 
positive e, there exists n,(e) such that for every n > n(e) 


K, rida, Aa — &; $o 0n) T £; $, bn] TE 
s Baol $, 0a) s Kn rldn sla +e; Šo» ba) — 8 $, 8n] +e 


simultaneously for every possible ¢. The theorem assertion (2.15) follows from 
(4.17) and condition B.1. 


(4.15) 


(4.16) 


(4.17) 
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Finally, we show that for every sufficiently small positive e, there exists ng(£) 
such that for every n > n,(e), 
(4.18) Bn, P, (a; En» 4, n? e) s Bn, 9% a 84) <s Baso. AG; P 6,5 e). 
Let 
a(€) = {d,,r(a + 2e; $o, 8.) TES d,,1(a + E; os On) 
< dp, sla — e; osn) < dn, „(a — 2e; £o» 6) +e}. 


Under condition B.2’, lim „— Pgo, [B,(e)] = 1, by an argument based on (4.8) 
and similar to that for (4.14). Thus, 


Kn rld, zla +e; os ba) — 8 Ens 6,| 

< Pf a [Tn > dy, (a + 2e; £o, Ân) — 2e; Ens On) + o(1), 
Kyla, A@ — E; Eo, ba) +e; Ens A 

> PP o [Ta > dp a — 2e; £o, n) + 28; Ens On] + o(1) 


as n tends to infinity. Combining (4.20) with (4.17) yields (4.18). The theorem 
assertion (2.17) follows from (4.18) and condition B.1. 


(4.19) 


(4.20) 


PROOF OF THEOREM 3.1. Essentially the same as Theorem 1 in Beran (1984). 


PROOF OF THEOREM 3.2. Let {(§,,6,) E Q} be any sequence such that 
lim , aon (En — 9) = h € R* and lim, „b, = b4. From E.2, 


(4.21) Jim c (a; $n 0a) = cla; £o, 94) 


the upper a-point of J(£,, 84). Moreover, in view of E.1 and B.2, {c,(a; Ên, 6,)} 
converges in Př , probability to e(a; $o, 84). It follows from these two conver- 
gences and E.2 that 


(4.22) lim sup |K,, nlen(as £, 0n); €, Onl — Ba ola; E 8) | = 0 
n=O ABE tals 
for every positive c; the argument parallels that for (4.4). 

If {(§,, 0n) €E Q} is any sequence such that lim „bn = 9,, conditions E.2 and 
E.3 imply that the distributions {dJ, p(é,,9,)} are tight; Te 
sup,lc„(a; £n» 6,)| is finite for every a € (0, 1). Suppose lim, — .2/7|£, — Eol = 
It follows from E.3 and the above that lim, wK n, rlen(a; Ens On) Ens Ond = z 
Since R, > œ and ĝ, >b, in PL 6, -probability, there exists versions 
of En h» R,,)} such that these convergences occur with probability one. For 
these versions, sup,|c,(a; Èn ê .)| is finite with probability one. Hence 
lim „p obn, yf 05 Emba) =L 

From (4.4) and the previous paragraph, 


(4.23) im sup| Kn, nlen(as å, 6,)3 £; 6,.)] = Bn, (0; £, 64)| =0 


for every sequence {6,} converging to @,. Theorem assertion (3.5) follows from 
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(4.23) and B.1. If conditions B.2 and E.1 are strengthened to B.2’ and E.I’, then 
6, may be replaced by @, in (4.23). Consequently, 


(4.24) tim me Bn, (0; Ens 0,.).= Baol 0 £, 6,)| = 
which implies (3.6), in view of B.1. 


PROOF oF THEOREM 3.3. The argument is a modification of the proof for 
Theorem 2.3. Without loss of generality, assume that the pivot F(X, £) takes its 
values in [0, 1] for every ¢. Let {(€,,, 8„)} and En 6,)} be any two sequences in Q 
such that lim, 0n = lim, ,,.8, = 9, and lim, ..m,(é,,&,) =0. Then, for 


every sufficiently small positive e, thers exists no(£) such that 
Cp, (a + £; $i ba) — e< ca, (a; En 8) 
(4.25) < Cp, gla; Èn On) 
S Cp, „(a —e&,,0,) +8 
for every n = n,(e). This follows from F.3 by an argument similar to that for 
(4.8). 

Conditions B.2, F.1, and the reasoning for (4.12) now imply: for every suffi- 
ciently small positive e, there exists 7,(e) such that 
K,,, nln, uf TE; 53 0n) + £; ts 6,] 7E 

Bn, (0; is 04) <s K,„ rlen (a + £; Eis bn) Te; En 6,| +e 
for every n = n,(e). 

A further argument, drawing on F.2 and similar to that for (4.17), shows that 
6, can be replaced by @, in the two bounds in (4.26). Hence, for every sufficiently 
small positive e, there exists ,(e) such that for every n = n,(e) 

Kn plen, „(a — &; £, 6,) +e; $, bal 7E 
< Baala; $, 0a) < Kn rlen (a +e; £,6,) — eé, 0] +e 


The theorem assertion (3.7) follows from (4.27) and B.1. 
Let 


(4.26) 


(4.27) 


C,(e) = {cn pa + 2e; )-ex<c, (ate; Ens On) 
(4.28) <s Cn, ula jj Ens ba) 
S Cp, {a — = En Â) +e}. 


Under conditions B.2’ and F.1’, lim, .. „Pgo, [C,(e)] = 1 for every positive suffi- 
ciently small e; the argument is based upon (4.25) and is similar to that for (4.14). 
Continuing along the lines of the last paragraph in the proof of Theorem 2.3 
yields (3.9). 
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ASYMPTOTICS FOR CONFIGURAL LOCATION ESTIMATORS?! 


By STEPHAN MORGENTHALER 
Yale University 


This paper examines the asymptotic properties of compromise estima- 
tors. By this we mean an estimation method which compromises between a 
finite number of sampling situations in a small sample optimal way. We 
develop the asymptotic theory of such estimators in the location problem and 
show that under a specific choice of a pair of sampling situations the 
compromise estimator is asymptotically robust in Huber's sense. 


1. Introduction. Configural polysampling denotes a method of estimation 
which is geared to small sample sizes and produces “robust” methods [see 
Pregibon and Tukey (1981)]. There are important differences to the robustness 
philosophy as developed by Huber (1964). Since in small samples the distribu- 
tions of estimators are quite intractable, one has to rely on numerical methods in 
order to evaluate even relatively simple performance summaries such as the 
mean-square-error. This holds true except in some simple cases—-such as the 
Gaussian sampling model—where a few expectations can be evaluated in closed 
form. In this connection, it is important for the statistical community to realize 
that numerical methods are perfectly acceptable. They do, however, limit the 
number of sampling situations we can take into consideration. This is in contrast 
to an asymptotic approach, where, for simple models, an infinity of sampling 
situations can be considered simultaneously [Huber (1964)]. 

Pitman (1939), for example, solves the small sample problem for a single 
sampling situation in a location and scale setting. In this paper we will show 
what happens if Pitman’s method is extended to two sampling situations with 
known scale. And we will address the question of the asymptotic performance of 
such estimators. 

An asymptotic analysis is the simplest way to learn something about the 
behavior of an estimator in a variety of sampling situations. But it only gives a 
partial answer and we should not forget the more important approach based on 
performing “experiments” for small sample sizes. This paper, however, will 
restrict attention to asymptotic discussions. 

In Section 2 we will introduce the idea of compromise estimators and discuss 
their optimality properties. Section 3 contains the corresponding asymptotic 
theory. As an example we define a compromise estimator which is asymptotically 
everywhere at least as good as Huber’s minimax estimator. 
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2. Configural estimators. 


2.1. Pitman’s estimator. Let x,,2%9,...,%, be n independent observations 
from a symmetric distribution F(x — u) where 1 — F(x) = F{—x) for all x. We 
also assume that F(x) # 0 or 1 for any finite x and furthermore that F( ) has 
density f( ) with respect to Lebesgue measure. 

We restrict attention to symmetric sampling situations in order to avoid the 
issue of what “parameter” we try to estimate. Symmetry of the underlying 
distribution allows us to define a target, namely p = center of symmetry. Further- 
more, we will not get into any discussions if later on we allow for two—or many 
—different sampling situations. The center of symmetry is well defined for all 
symmetric shapes which means the estimation of p is a well defined problem for a 
large class of sampling situations. 

The solution Pitman gives is 


JLo T r + r)dr 
SET f(x, + r) dr 


[see Pitman (1939)]. This estimator has the smallest mean-square-error among all 
location equivariant estimators. Location equivariance is a reasonable restriction 
on a location estimator T( ). It means that 


(2.2) T(x, +r, £a +r) = T(x,,...,%,)77r, reéER, 


(2.1) Ree Sa 


i.e., the estimator changes in the same way as the sample. 


REMARKS. (1) The most revealing way of deriving (2.1) is through the 
concept of “configurations.” By this notion, we mean the pattern of the points in 
the (ordered) sample, as specified, for example, by the gaps between the observa- 
tions. It is easily seen that this is an ancillary statistic. The Pitman estimator 
then is chosen such that conditioned on the configuration the estimate is 
unbiased. Since the conditional variance cannot be affected by the choice of the 
estimate (under equivariance), this has to produce the smallest mean-square- 
error. 

(2) The conditions on f( ) such that (2.1) exists are discussed in Pitman 
(1939). 

Formula (2.1) produces an estimator Tp of the center of symmetry » no matter 
what the underlying sampling situation. It, therefore, need not be so that the x,’s 
are sampled from F(x — u). 

Let us introduce G(x — »)—again G(x) = 1 — G(—<x) for all x’s—as the 
sampling situation for x,,...,x,. This is a new way of looking at the Pitman 
estimator Tp and it, of course, immediately lets us see the optimality property in 
a new light. If, e.g., F= Ọ and G = Cauchy, we are looking at the behavior of 
the arithmetic mean under Cauchy sampling. If we are open minded about the 
assumptions we base our inference on, we have to admit that in small samples we 
cannot, with any reasonable precision, know what the underlying sampling 
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situation is nor should we attempt to make inferences about it. Huber (1964) 
formalizes the idea of a robust method as a procedure which “behaves well” in 
the neighborhood of a parametric model. Huber, therefore, would allow G( } to 
be chosen somewhere near Fi ) and he modifies Tp in such a way that the 
behavior of the new estimate is acceptable for all allowed G( )s. This leads us 
away from considering estimates like Tp which are optimized at a single “point.” 
Since—in small samples—we will never be able to tell at which “point” we are, it 
ought to be obvious that single-point-optimization is a bad strategy. 


2.2. Compromise estimators. Let us now consider the case where x,,..., Xn iS 
a sample from either F(x — u) or F(x — p), where F, and F, satisfy all the 
constraints of F (see the beginning of Section 2.1). We are now interested in 
location equivariant estimators which optimize at two “points,” namely F, and 
F,, simultaneously. This is obviously impossible. However, decision theory teaches 
us that estimates of the form 


Tye, Fy, a\¥19+++9 Xn) 
(2.3) _ fria hlr +r) + (1 - a), f(x, + 7r)} dr 
fol f(a, +r) + -a)i fax, + r)} ar 


(0 < a < 1) are bioptimal in the sense that they cannot be improved in both 
sampling situations F, and F, simultaneously [see Ferguson (1967)]. 


REMARKS. (1) We can also write 


a Tp. Fa, (Xise En) = Wgl Xis. Ln) TaXin) 
4 
AL = G1) ug (24, «45 Bp) Taht En), 


where 
Ae JTL fala, + r) dr 
{oll f(x, +r)+(l1- a)l fol x, + r)} dr 


(k = 1,2) and T,( ) is defined in (2.1). We, therefore, can interpret the family of 
bioptimal estimators as a weighted mean of the single-situation optimal estima- 
tors. Note, however, that the weights are “adaptive,” they depend on the sample 
values. Of course, any equivariant estimator can be represented as a weighted 
mean of the single-situation optimal estimators. What matters here is the 
simplicity and form of the weights together with their small sample optimality 
property. 

(2) It is clear from (2.3) that Tp, 7,0. = Tp, and Tr, Ft = Tr. 

(3) The picture which helps us most in understanding the compromise estima- 
tors is shown in Figure 2.1. 

Note that since we only consider location equivariant estimators the risk in 
any given situation does not depend on the parameter value p [see Ferguson 
(1967)]. The bioptimal or compromise estimators are the ones which lie on the 
convex boundary curve. 


Wph Xis- -s Xn) 
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Fra. 2.1. Plot of the mean-square-errors. 


(4) A Bayesian interpretation of the estimator (2.3) is possible. In that 
framework, («æ,1 — a) is a prior distribution on the set of underlying sampling 
shapes. 

(5) In order to implement (2.3) in an actual application, formula (2.4) has some 
interesting interpretations. Pregibon and Tukey (1981) derive the formulas from 
the point of view of sampling in the case of unknown scale. This leads to the 
consideration of different weights wp, and Wp. 

The choice of the two compromising distributions F, and F, is of importance 
in applications of the technique. In many applications it is traditional to consider 
F, = ®, the Gaussian shape. The choice of F, is somewhat related to the choice 
of the contamination parameter e in Huber’s model. F, will influence two aspects 
[see (2.4)]: 

(i) the “relative weights” wp and Wp; 

(ii) the “other” optimal estimator Tp, 

These two aspects have an interpretation in the theory of M estimators. The first 
is connected with the choice of tuning constants, as for example, k in Huber’s y, 


178 8. MORGENTHALER 


function (y(x) = max(— k, min(k, x))), and the second with the shape of the ¥ 
function. From small sample studies, we know for example that a redescending 
y function is advantageous—it costs little and buys a lot. If we want such a 
behavior then F, has to be chosen as a heavy-tailed counterpart to the Gaussian, 
for example, a distribution with Pareto tails. 


3. The asymptotic behavior of compromise estimators. In this section 
we are going to explore what happens to compromise estimators [see (2.3) or (2.4)] 
if we sample from a distribution G( ) and let the sample size n grow. We will see 
that the weights w, and wp, usually tend to (0,1) or (1,0), respectively. A 
compromise estimator for large sample sizes, therefore, will be close to either the 
optimal estimate under F, or the optimal estimate under F,. This is a reasonable 
behavior since the “information” about the sampling situation G( ) grows as the 
sample size gets large. The distinction between F, and F, is, therefore, more and 
more estimable. In a few words then, we can say that compromise estimators 
exhibit an adaptive behavior with the relative weights wp and wp, [see (2.4)] 
guiding the adaptation. 


3.1. The asymptotic behavior of the relative weights. Suppose x,,...,X,, isa 
sample of size n from G(x — u). We assume that G( ) is symmetric around 0. 
The relative weights are defined as 

STE- fn, + r) dr 
(3.1) wp(x1,---5%,) = a 
f{ol ls h(x, +r)+(1 aT. f(x, + r)} dr 
(k = 1 or 2), where the notation is the same as in (2.4). 


The following lemma treats an overly nice case, a model distribution with 
Pareto tails for example is not covered. 





LEMMA 3.1. Let us assume that both —log f, and —log f, are convex. We 
also require that the first two derivatives exist and that the second derivatives be 
bounded: 


d d 
F5 log f,(x), a log fa(x) 


exist and 
2 2 
a log f,(x), de log f(x) 
exist and are bounded from below. 


Finally, with regard to the modeling distributions F, and F,, we need 
2 2 
Pa log f,(x) <0 and zm 


Let us furthermore assume that G is such that the functions 
A\(r) = Í log f,(x + r) dG(x) 


log f,(x) <0 forx€[-6,8] forsome ô> 0. 
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and 
A*(r) = flog f(x + r) dG(x) 


exist for all r, achieve a unique maximum at r = 0, and are such that the second 
derivative at r = 0 exists and is bounded. If 


(3.2) frog fi(x) dG(x) > flog fax) dG(x) 
it follows that 

Wy X1)--+,Xq) ET 

Wy (X1,+++) Xp) 


Proor. Let X,, X,,... denote a sequence of iid. random variables with 
common distribution G. From (3.1) we have 


Wp Xp. X,) z JT hX, +r)dr 
wp(X,,..., Xa) Mh, +r) dr’ 





Now 
IXa.. X) = fJ TIIE, + r) dr 
= Jfa} 3 log( f( X, + r)) | dr 
= fexp(nA,(r)) dr, 
where 


Ar) =~ È log A(X, + 7)) 


t=} 


and f stands for either f, or fz. 
(1) Due to the strong law of large numbers we have 


(3.3) A,(r) > A(r) as. forall r. 


Convergence is uniformly in r for r € [—56, ô] because of our convexity assump- 
tions. For the same reason, the function A,(7) is maximized in some interval. Let 
R3 denote the center of that interval of maximal points. For n large enough the 
maxima is unique since —log f is strictly convex in [—8, 8]. It follows that 
R> 0 as. 
(2) Let us define 


Im, nX. Xn) = f exp(m4,(r)) dr. 


The integral [,, , allows an asymptotic expansion as m — oo. For large values of 
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m, we get 


In, a(Xo -s Xn) ~ f exp(m| A (R3) — HA%(R3)I(r — R8)°]) ar 
(3.4) T 1 
~ exp(m4(25))(=2) “(lagna)” 


The theory of asymptotic expansions in this simple case is treated, for example, 
in Chapter 4 of deBruijn (1981). Convergence of In, ,(X;,---, Xn) takes place in 
the sense of real functions. The values of X,,..., X, do not matter. For large 
enough n, the convergence in (3.4) is uniform in n. This follows from the error 
formula in deBruijn (1981, page 64) and the bound on the second derivative of 
A,(r) provided by the lower bound on the second derivative of log f. In short, the 
argument goes as follows. The asymptotic expansion (3.4) works uniformly with 
respect to n because the function A,(1r) can be approximated by a quadratic near 
its maxima uniformly in n. 

(8) If we blend the probability structure which underlies the sequence 
X,, Xg,-.. with the asymptotic approximation (3.4), we get 


1 
7 log I( X,,..., X,) > A(0) = fio f(x)dG(x) a.s. 
We therefore conclude from 


Prog f(x) dG(x) > flog f(x) dG(x) 
that 


1 1 
a log PAM 2205 Xa) = = log DIX ves X,,) 


(=) Wp, (X,,..., Xn) TF 
= i og wr (Xn, Xn) > constant > as., 


where D(X... Xa) [,(Xj,..., X,) refer to f=f, and f= f}, respectively. 
From this last statement the assertion of the lemma follows immediately. 


However, Lemma 3.1 is not strong enough for our purpose. If the underlying 
distribution G does not have a first moment for example, then Ar) and A?(r) 
will be — oo for all r. Under stronger assumptions on F, and F,, we can prove the 
following. 


LEMMA 3.2. Let G be such that its second moment is infinite. The assump- 
tions on F, and F, from Lemma 3.1 still hold. Furthermore we assume that the 
ratio d(x) = f,(x)/f,(x) is bounded and satisfies (log d(x)? = O(x~*) for 
large |x|, i.e., log d(x) tends to — cœ like a quadratic or faster. 

It then follows that 


MEREN 


>00 a.s. 
Wy (X1,.-+, Xp) 
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Proor. Let X,, X}... denote a sequence of iid. random variables with 
common distribution G. Then we have 


ITT 22, FX, + r) dr = JST d(X, ag rZ f(X, + r) dr 
{TIL A(X, + r)dr {TIL A(X, +r) dr 





< max Tlacx, +r). 
t=1 
The lemma is proved if we can show that max!1”_, d(X, + r) > 0 a.s. But this is 
a consequence of £7, log d(X, + r) —> —œ a.s. which follows from our assump- 
tions via the strong law of large numbers. 


REMARKS. (1) The asymptotic expansion (3.4) shows how closely the maxi- 
mum likelihood estimator is connected to the Pitman estimator. Note that 
|A’(0)| is equal to the Fisher information if G = F, i.e., the sampling distribution 
and the modeling distribution are the same. We will see below that the maximum 
likelihood estimator is indeed asymptotically equivalent to the Pitman estimator. 

(2) It is reasonable to believe that Lemma 3.1 holds in greater generality. The 
convexity conditions on the —log densities are probably not needed and could be 
replaced by suitable assumptions on F, and F, close to the origin. 


COROLLARY 3.1. Under the assumptions of the Lemma 3.1 or Lemma 3.2, the 
compromise estimator Tp py, (a = 0) is asymptotically equivalent to the Pitman 
estimator Tp. 


Proor. Apply the lemmas to formula (2.4). 


REMARKS. (1) Corollary 3.1 states that with increasing sample size the 
compromise estimator will pick either one of the two single-situation-optimal 
estimates depending on (3.2). 

We therefore expect that 


f(x) 
f(x) | aa) 


is a quantity which decides whether the sampling situation G is “closer” to the 
modeling situation F, or the modeling situation F}. 

The quantity (3.5) is closely related to the Kullback—Leibler mean information 
for discrimination [Kullback and Leibler (1951)]. Their formula is 


14:2) = fog Fey is a, 


where I(1: 2) is the mean information for discrimination per observation from 
sampling situation F. 

(2) The asymptotic behavior of the compromise estimators (2.3) does not 
depend on a [unless (3.5) = 0]. 





(3.5) flog f(x) dG(x) - flog f,(x) dG(x) = flog 





182 S. MORGENTHALER 


(3) More results about Pitman estimators can be found in Johns (1979) and 
Klaassen (1981). Easton (1984) has proved the results given in Section 3.1 for the 
more general case of unknown scale. 


3.2. Asymptotics of the Pitman estimator. In order to get asymptotic 
efficiencies for the compromise estimators we need to know more about the 
asymptotic behavior of the Pitman estimators Tp, and Tp, Port and Stone (1974) 
provide the information in the case where the sampling situation and the 
modeling situation are identical. In our more general setup we can argue the 
following way: 

Tee) a srexp(nA,(r)) dr 
thy ents ge fexp(nA,(r)) dr ’ 


where 


ad= Egiat (1-240), 


tml 


If we expand the numerator asymptotically we get 


exp(nA,,(7")) f rexp{ — 4nA”(73°)(r an N) 


Teltis. £n) ~ — fexp(nA,(r)) dr 


n 
~ “To, 


where rë maximizes A,(r) (see deBruijn, page 66). We therefore showed that 
asymptotically the Pitman estimator and the maximum likelihood estimator 
(—r,') agree. This agreement is good enough—namely T,(X,,...,X,) + = 
o,(n~'/?)—-to conclude that the asymptotic distributions are the same. Huber 
(1967) then provides the necessary results. 


3.3. Huber’s contamination model: An example. To illustrate the use of the 
theory we developed, let us look at the compromise estimators based on the two 
modeling densities 


f(x) = (2) = aya) 
h(x) = (1 — e)@(x) if |x| < k, 
= aa ep $- — ail otherwise, 


where k is such that (26(k)/k) — 20(—k) = &/(1 — e). The alternative den- 
sity is, of course, the least favorable choice in the class of distributions 
{(1 — e)®( ) + eH( ): H( ) symmetric} [see Huber (1964)]. 

The asymptotic variance of an estimator compromising between these two 
symmetric situations [see (2.3)] will be equal to either of the asymptotic variances 
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of the Pitman estimators, 
Tp, = arithmetic mean 
or 
Tr, = Pitman estimator for the least favorable distribution. 


If we sample from distribution G( ), we have for these asymptotic variances 
(ug = {x dG(x)) 


as. varg(Tp) = J(= - no)’ dG(2), 


MENE E Ba)” dG(x) 
(Jila — wg) dala)? 


as. varg( Tr, ) = 


where 
W(x) FE ~fy(x)/fe(x) = max(—k,min(k, x)). 

In his 1964 paper, Huber shows that the M-estimator based on Ņ,( ) is 
asymptotically minimax for sampling situations chosen from the contamination 
class. Since Tp, has the same asymptotic behavior as this M-estimator, the same 
claim can be made for Try The following proposition explains the asymptotic 
behavior of the compromise estimator [see (2.3)]. In order to be able to prove it, 
the contamination class needs to be reduced a bit. 


PROPOSITION 3.1. Let G(x) = (1 — e)®(x) + eH(x), where H(x) + A{-x) 
= 1 for all x’s and H( ) puts all its mass outside the interval [—k, k], but is 
otherwise arbitrary. Furthermore, assume that 0 < e < 0.5. Then 


as. vat,,( compromise estimator) < as. varg( Huber’s minimax estimator). 
Proor. From Lemma 3.1 and Lemma 3.2, we know that 
fog fix) d@(x) — flog f(x) dG(x) 


= f (oea i} act) -J oe yA 5] 2008) 


(3.6) 3 œf k? 
- ‘es 4x? dG@(x) — 2f & = asl dG(x) 


Il 


æ k? <7 
—log(1 — e) aE 2f (ate oa sae =| dG(x) 


is the quantity which decides about the asymptotic variance of the compromise 
estimator. Note that we made use of the symmetry of the sampling distribution 
G in the derivation of (3.6). If (3.6) is positive, the compromise estimators will 
behave asymptotically like the arithmetic mean, otherwise like the Huber estima- 
tor. All that remains to be considered, therefore, is the case where (3.6) is positive 
(or zero) because in the other case the assertion of the proposition is trivial. 
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First, note that (3.6) can only be positive if G has finite variance. Using our 
assumptions about G = (1 — s)® + eH stated in the proposition, (8.6) can be 
written as 

2 2 


— log(1 — e) + 2(1 — of (a e E = = jel) dx 


+ef 
(3.7) = ~log(1 — e) — (1 — e) f(x - k}? d(x) 
=ef e- k)? dH(x) 


> —log(1 — e) — (1 — e)|-ko(k) + P(—k)(1 + k?)] 
E E 
-gP g 
where of, = fx? dH(x) is the variance of the contaminating distribution. 
A comparison of the asymptotic variances of the sample mean and Huber’s 
estimator is not hard. We have 


2 2 


k 
hie|- > =| dH(x) 








(3.8) as. var,(sample mean) = (1 — e) + eo}, 
as. var, (Huber estimator) = Hyala) dG tz) 
({4(x) dG(x)) 
(3.9) _ fE x dG(x) + 2ER Oe) 
(Ivl) dG(x)) 
_ (1 — 2) {*%,x?d@(x) + k?e + 2k7(1 — e) ®(-k) 
(1 — e)'(@(k) - ®(—R))” 
In this last formula we have again used all our knowledge about the sampling = 
situation G( ). 


What remains to be shown is 
nonnegativeness in (3.7) > 3.8 < (3.9). 
But 
(3.7) = 0 > ss < —2log(1 — e) + 2(1 — e)kọ( k) 
—2(1 — e)®(—k)(1 + k?) — ek? 
and, therefore, we have 





(3.8) = (1 — e) + eof < (1 — £) + oe| Sj + 2(1 — e)kọ(k) 


—2(1 — e)®(—k)(1 + k?) — ek?. 
P 
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Using the equation linking e and k, 














26(k) sine 
ko a (1-—e)’ 
we can simplify and get 
(3.8) < oe = -) +(1- e)h— 3 : + (1 — e) — 2(1 — 2) ®(—k) — ek?, 
(3.8) < og . z | + (1 —6)(1 - 20(-2)). 


Along the same line of thought, we can simplify (3.9) to get 

1 
(=e) — 28(-k)) 
Putting all these results together, we finally have 


(3.9) = 











(3.9) >1+(1-e) ah 
> (1-e)(1-28(-k)) + (1 - ya) 
> (1 — e)(1 — 20(—2)) + oe = :} > (3.8) 


if only we show that 


46(k) i 1 yẹ 
k > loe{ —] 


holds. This last inequality is only true for e small enough, e.g., e < 0.5. For such € 
values we have 


(3.10) (1 -.) 








1 2 
og < 3e, 0<2< 05, 
l—e 


and (3.10) is therefore proved if we show that 
49(k) 
1 = 
a-92 





> 3e 


4o(k) e $(k) 
= 2—— — a 
ap ee E 
© 2kð(—k) > 20() 
© 3kO(—k) 2 o(k) fork €[0.436,00). 
This last inequality (3.11), which is equivalent to (3.10), does indeed hold and 
is left for the reader to check. 
Proposition 3.1 is now proved for all the cases where (3.6) is strictly positive. 
Some care is needed if (3.6) is zero. Then the compromise estimator is asymptoti- 


(3.11) 
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cally a convex linear combination of Tp and Tr,» but since the asymptotic 
variance of Tp is lower than the asymptotic variance of Tp, the compromise 
estimator will have an asymptotic variance below the asymptotic variance of Try 


REMARKS. (1) We have identified a class of sampling situations G, namely 
those where (3.6) is positive, for which the mean is a more efficient estimator than 
Huber’s minimax estimator. It would be of interest to show how big this class is 
and also to check whether it contains all sampling situations for which the 
sample mean is asymptotically better than Huber’s minimax estimator. 


4, Discussion. This paper deals with estimators which compromise between 
different “shapes.” This idea, as we have seen, produces robust estimators. If we 
compromise between the Gaussian and Huber’s least favorable distribution, we 
have a family of estimators (for different values of a) which dominate Huber’s 
minimax M-estimator asymptotically. 

Several points need to be clarified, however. The idea of compromising is 
different from the usual asymptotic robustness theory as developed by Huber 
(1964) and (1981). There, the compromising takes place in a neighborhood of the 
“central” model, whereas in our approach the different shapes need not be close 
together. A neighborhood model is in fact only a first step toward robust /re- 
sistant techniques for small sample sizes. For samples of size 5, we would advise 
to compromise between the Gaussian and something like the slash (= distribution 
of a ratio of a Gaussian over an independent uniform) rather than using the only 
moderately tailed least favorable distribution. 

The intention of this paper is not to show that we should use a compromise 
between the Gaussian and the least favorable distribution, but rather to let 
people know of the merits of compromise estimators in a language which many 
statisticians are used to, namely asymptotics. 

Results found through small sample experiments are of greater importance. It 
is clear, for example, that the situations (or shapes) we compromise ought to 
change with the sample size. The amount of “information” in the sample grows 
with the sample size. Not only are we able to estimate “parameters” with less 
variability, we also gain insight into the underlying shape. Compromise estima- 
tors use this knowledge in an optimal way and with our choice of the shapes we 
can fine-tune the procedure. Important choices have to be made in that respect 
and more (probably experimental) research for small sample sizes is needed. 
Subject-matter knowledge might prove useful in this connection. 

The extension of Pitman’s ideas to more than one shape provides us with a 
tool to find meaningful small sample methods of the robust /resistant kind. In 
order to make the asymptotics simple, we did not deal with the scale parameter. 
In actual applications, the inclusion of this additional parameter is, however, no 
problem [see Bell and Morgenthaler (1981) for an example]. 


Acknowledgment. Many thanks to the referees for a careful reading of the 
manuscript, and for pointing out several mistakes. 
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MINIMAX MULTIPLE SHRINKAGE ESTIMATION 


By EDWARD I. GEORGE 
University of Chicago 


For the canonical problem of estimating a multivariate normal mean 
under squared-error-loss, this article addresses the problem of selecting a 
minimax shrinkage estimator when vague or conflicting prior information 
suggests that more than one estimator from a broad class might be effective. 
For this situation a new class of alternative estimators, called multiple 
shrinkage estimators, is proposed. These estimators use the data to emulate 
the behavior and risk properties of the most effective estimator under consid- 
eration. Unbiased estimates of risk and sufficient conditions for minimaxty 
are provided. Bayesian motivations link this construction to posterior means 
of mixture priors. To illustrate the theory, minimax multiple shrinkage Stein 
estimators are constructed which can adaptively shrink the data towards any 
number of points or subspaces. 


Ë 


1. Introduction. Consider the following canonical setup. From p indepen- 
dent experiments, we observe Y = (Y,,..., Y,)’, which has the p-dimensional 
multivariate normal distribution 


(1.1) Y| ~ N,(6, I), 


with unknown mean @ = (6,,...,6,)’ and the identity covariance matrix I. The 
problem is to find estimators 8 = (Y): R? > R? of 0 which yield small risk or 
expected squared-error-loss 


(1.2) R(6, 8) = E,(8 — 6)'(6 — 8) = Bgl — 8l’, 


where E, stands for averaging over the sample space with respect to the 
distribution (1.1) for fixed 6. 

Beginning with the seminal work of Stein (1955) and James and Stein (1960), 
interest has focused on the use of minimax shrinkage estimators for this problem 
[see Berger (1983)]. Each of these estimators not only dominates the maximum 
likelihood estimator 5M™£(Y) = Y, but also yields substantially smaller risk in a 
certain region of the parameter space. By selecting an estimator for which 6 
happens to be close to its corresponding region of improvement, meaningful risk 
gains can be achieved in practice. However, because @ is unknown and an 
estimator must be selected before looking at the data, the selection of an 
estimator or equivalently the region of improvement is typically based on 
available prior information. As a result of this feature, a large number of minimax 
shrinkage estimators have been developed, offering a wide variety of regions of 
risk improvement corresponding to different types of prior information [see 
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Berger (1982) and Berger and Berliner (1984) for general discussions and refer- 
ences]. 

In this paper, we consider the general situation where conflicting or vague 
prior information suggests that more than one of a broad class of minimax 
shrinkage estimators may be effective. For this situation we present new minimax 
multiple shrinkage estimators which can incorporate this partial prior informa- 
tion by using the data to emulate the behavior and risk of the most effective 
estimators under consideration. These multiple shrinkage estimators enhance the 
practical potential of currently employed minimax shrinkage estimators by vastly 
broadening the region of the parameter space where meaningful risk reduction is 
available. 

For example, suppose attention was restricted to using a Stein estimator of the 
form 


sey ey a 7 

(1.3) ôS(Y)= Y i A Y- op (Y-v) 

[a A b= min(a, b)], which shrinks Y towards a target v € R”. (When v = 0, 83 
is the original positive-part Stein estimator which shrinks Y towards 0.) As is well 
known, when @ happens to be in a small neighborhood surrounding v, ôS yields 
very small risk, and when @ is far from this neighborhood, ô$ is essentially 
indistinguishable from MLE, Typically, v would be a prior guess as to the 
location of 6, perhaps the result of a previous experiment. 

However, suppose prior information suggested several different choices for the 
target v. Denoting the corresponding choices for 55 by 63,..., dg, use of a single 
ô$ would potentially forego important risk gains, especially if some of the target 
choices were far from each other. To avoid this limitation, we propose a multiple 
shrinkage Stein estimator for this situation. This estimator, which is described in 
greater generality in Section 3, is here of the form 


K 
(1.4) 83(¥) = » pi(Y)d(¥), 


where p%,..., p% satisfy D¥_,p$(Y) = 1 and are adaptive functions of Y which 
place increasing weight on the ôf which are shrinking most. Thus, ô$ is an 
adaptive convex combination of the 63 which provides more shrinkage when Y is 
close to any of the targets. Unbiased estimates of risk and simulation results, also 
provided in Section 3, suggest that ô$ can offer meaningful risk reduction at each 
target. Moreover, it is shown that ô$ is minimax, and so possesses the same 
robustness quality as each ôf with respect to misspecification of the targets. 

In Section 2 general results on the construction, risk assesament, and Bayesian 
motivation of multiple shrinkage estimators are provided for the situation where 
a finite number of a broad class of minimax estimators are being contemplated. 
In Section 3 minimax multiple shrinkage Stein estimators are proposed and 
analyzed. In Section 4 the construction and assessment of multiple shrinkage 
estimators is indicated for the situation where a possibly infinite set of estimators 
is under consideration. In Section 5 it is shown that the main results of this paper 
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generalize easily for the more realistic situation where Y], o ~ N,(6, 07) with 
an available independent estimate of o°. 


2. Multiple shrinkage estimators. The following definitions are required. 
A function m: R? —> R is said to be almost differentiable (a.d.) if there exists a 
function Vm: RP > RP such that for all z € RP, 


m(y +2) —m(y) = ['2’vm(y + tz) dt 


for almost all y € R”. This definition implicitly defines Y be the vector differen- 
tial operator 


V = (Yi. Vp), where y, = 3/y,. 


(Essentially an a.d. function is continuous and a.e. differentiable.) The function 
vm is said to be a.d. if each coordinate function V,m is a.d. When both m and 
vim are a.d., m is superharmonic if for almost all y € R”, 


v?m(y) = 3 vêm(y} <0. 


ta] 


See Helms (1975) for an introduction to more general superharmonic functions. 


2.1. Constructing multiple shrinkage estimators. Throughout this section, we 
consider the general situation where vague or conflicting prior information 
suggests that small risk may be obtainable by any one of K shrinkage estimators 
of the form 


(2.1) 8,(Y)=Y+vlogm,(Y), k=1,...,K, 


where m,: RP — R*™{0}* is such that m, and Ym, are a.d. For each estimator 
ô, the function m, determines the shrinkage component, V log m,(Y). The class 
of estimators of the form (2.1) includes all Bayes, formal Bayes, and admissible 
rules [see Brown (1971)], and some reasonable inadmissible rules such as the 
Stein estimator ôS in (1.3) (see Section 3). 

When the regions where each of 6,,...,6, offer especially small risk are very 
different, it may be preferable to consider using a multiple shrinkage estimator 
&, which we define to be 


K 
(2.2) §4(Y) =Y+vlogm,(Y), m.(Y)= ) w,m,(¥), 
kel 
where m,,..., Mg are the functions corresponding to 6,,...,8, in (2.1), and 
K 
(2.3) Wises Wk, X WwW, = | 
k=l 


are a fixed set of prespecified positive weights (scaled as probabilities for conven- 
ience), which we shall refer to as prior weights. In Section 2.3 it is shown that 
when 6,,..., ôg are Bayes rules, 5, is the Bayes rule for a mixture prior, and the 
prior weights arise naturally as prior probabilities. 
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The following reexpressions of 6, illustrate the relationship between the 
behavior of 5, and 6,,...,6,, suggesting the description of 5, as a multiple 
shrinkage estimator, 


K K 
(2.4) (Y) Sy x o(Y)V log m,(Y) = È pl YOY), 
where 
(2.5) p(Y) = wym,(Y)/m,(Y). 


Since D£_,p,(Y) = 1, the middle expression in (2.4) reveals the shrinkage compo- 
nent of 6, to be an adaptive convex combination of the shrinkage components of 
5,,...,8,; the rightmost expression shows 5, as an adaptive convex combination 
of the estimators 6,,..., ôx. We shall refer to p,,..., Pg, which adaptively weight 
the shrinkage contribution of the combined estimators, as relevance functions, 
following the idea first introduced by Efron and Morris (1972,1973b). Each 
relevance function p, adaptively updates the prior weight w, by the factor 
m,/m,. Because p,(Y),...,px(Y) are proportional to the terms 
wm (Y),..., WgmMmg(Y), the relevance functions put larger weight on those ô, 
for which w,m,(Y) is larger. For example, when w,m,(Y) > w,m,(Y) for all 
J+ k, p,(Y) will be close to 1, and 6,(Y) will emulate 6,(Y). Note that when 
m,(Y) and V log m,(Y) are large simultaneously, 5, will incorporate more of the 
shrinkage of 6,,..., dg. 


2.2. Some risk results for multiple shrinkage estimators. In this section we 
establish some general results which link the risk properties of 6, with those of 
the combined estimators 6,,...,8,. Because 5, and 6,,...,4, are of the form 
(Y) = Y + y log m(Y), we make use of the following results of Stein (1973, 1981), 
which provide unbiased estimates of risk and sufficient minimaxity conditions for 
such estimators. 


THEOREM 1 (Stein). Suppose 6(Y)= Y + y logm(Y) where m: R?” > 
R* {0} is such that m and vm are a.d. If 


(i) E,|V2m(Y)/m(Y)|< œ, i=1,..., D, 


(ii) Elly log m(¥)||’ < œ, 
then the risk of 5 may be expressed as 
R(6,8) = p — E, D8(¥), 


2.6 
me) D&(Y) =||v log mY)? — 2V °m(Y)/m(¥Y). 

The expression Dê(Y) above is an unbiased estimate of the amount of risk 
reduction offered by 5 over 6MUE [ R(6, 5MLE) = p]. D8 is used throughout to 
express unbiased estimates of risk reduction. Note that when @ is such that 
D8(Y) is large with high probability, ô will yield especially small risk. Further- 
more, because D6(Y) = 0 when v?m(Y) < 0, the following sufficient condition 
for the minimaxity of ô is immediate. 
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COROLLARY 1 (Stein). If 6(Y) = Y + v log m(Y) satisfies the conditions of 
Theorem 1 and m is superharmonic, then 8 is minimax. 


Focusing now on the relationship between the risk properties of 6, and 
5,,..., Ôg, the following lemma shows when Theorem 1 and Corollary 1 may be 
applied to 6,. 


Lemma 1. If 6),...,5, satisfy the conditions of Theorem 1, then 8, will 
satisfy the conditions of Theorem 1. 


Proor. It is immediate from (2.2), that m,: RP > R*M{0}*, and that m, 
and Vm, are a.d. Condition (i) follows by observing that 


K K 
È eal ¥) vem ¥)/mal¥) < © |vim,(Y)/m,(Y)|. 


kml 


|v?m.(Y)/m,(Y)|= 





Condition (ii) follows from (2.4) and 


2 





K K K 
» pY)V logm,(Y)| < È eal ¥ iv log m,(Y) ||" < x |v log m (Y) |. 











o 


The next result provides an easily verifiable sufficient condition for the 
minimaxity of ô; a condition that is somewhat stronger than the minimaxity of 
6,,...,5,%. Because of the potential complexity of the inputs for ô, the protec- 
tion against misspecification provided by minimaxity is an especially appealing 
property here. 


COROLLARY 2. If 8,,...,6, satisfy the conditions of Theorem 1 and if 
™M,,..+) Mx are superharmonic, then §, is minimax. 


Proor. Because m, = L¥_,w,m, will be superharmonic whenever 
Mı... Mx are superharmonic, the result is immediate from Lemma 1 and 
Corollary 1. O 


To offer any practical advantage over MHE, a minimax estimator must yield 
meaningful risk gains somewhere in the parameter space. The following result, 
which links the risk reduction estimate D5, to Dé,,..., Dd,, suggests possible 
regions of improvement for ô. 


COROLLARY 3. If 5,,...,5, satisfy the conditions of Theorem 1, then 


‘ K 
(21) DELY) = ¥ DSD) -F ¥ este) -a| 
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Proor. By Lemma 1 and Theorem 1, 


D8,4(Y) =||V log m,(Y) |? — 2v?2m,(¥)/m,(¥) 


K K 
È px Y)v log m(Y)| — 2 p,(¥)(2v?m,(¥)/m,(Y)). 














The desired result is obtained by substituting 


2 


K K 
2 L oY )e(Y)(V log m,(Y))'(v log m,(Y)) 


k=] [m=] 


ll 








K 
2» o,(Y)Vv log m,(Y) 








ii 


K 
È PaY)|Y logm(Y) I’ 


1 K K 2 
-3 E L eK) eY) - aN, 
kollel 


where the last equality follows from 
18K) — 6(¥) |? =V log m,(¥) | +v log mY) IÈ 
—2(Vv logm,(Y))(v logm,(Y)). 5 


Corollary 3 suggests when 6, may offer meaningful risk gains in the same 
regions of the parameter space as any of 6,,...,5,. In (2.7) Dé, is expressed as 
an adaptive convex combination of bracketed terms, each of which consists of the 
risk reduction estimate Dé, penalized by a factor which weights the shrink- 
age conflict between ô, and the other estimators. Note that when p,(Y) = 1, 
D8 (Y) = Dê (Y), since p,(Y) = 0 for 1 + k. Thus, the size of Dé,(Y) will be 
increased by sharply adaptive relevance functions which, for each Y, put most of 
their weight on the largest Dé,(Y). Such behavior would yield R(6,8,) = 
min,R(@, 6,). Examples where this approximation is excellent are provided in 
Section 3. 


2.3. Bayesian motivations. In this section multiple shrinkage estimators are 
shown to arise naturally as Bayes rules under mixture priors in the Bayesian 
context. More precisely, suppose 5,,...,5, are Bayes rules corresponding to the 
prior densities 7,,...,7,, respectively. Using the well-known representation, see 
for example Stein (1981), each of these may be expressed as 


(2.8) 5,(Y) = E,(@)¥) = Y + v log m(Y|,), 
where 
(2.9) m(Y|m,) = f (2r) P e-1Y-9F/2 m (0) dO 


is the marginal density of Y under 7,. Replacing m,(Y) by m(Y|m,) and m,(Y) 
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by m(Y]7,), 5, in (2.2) becomes 

(2.10) 6.(Y¥) = Y+ v logm(Yiz,), where m(Y|7,) = > w,m(Y|n,). 
-1 

Because m(Y |7 +) is the marginal density of Y under the mixture prior 

(2.11) r(0) = ¥ ye (0), 


it follows that 6,(Y) = E, (6|Y) is the Bayes rule under 7,. The assumption 
that 8 has the mixture prior 7,, being equivalent to the assumption that 6 has 
the prior 7, with probability w,, nicely expresses the vague or conflicting prior 
information that any of 6,,...,5, may be effective. This method of combining 
prior information through mixtures can also be motivated in the multi-Bayesian 
context [see Kempthorne (1985)]. 

The Bayesian motivation also provides a natural interpretation of each rele- 
vance function in (2.5) which here is, 


(2.12) Pa(Y) = wym(¥ |, )/m(Y|74) = P(Y), 


the updated posterior probability that 0 has the prior density 7,. The alternative 
representation of 6, in (2.4), 


K K 
(2.13) 8,(Y) = È pl¥)O(¥) = È Plmal¥)E. (AY) 


shows how the relevance functions here put increasing weight on the posterior 
mean 6,(Y) = E,(6/¥) which is supported by the data through m(Y(m,). The 
use of finite mixture distributions to obtain robustness properties in the Bayesian 
context has been used by Box and Tiao (1968), Abraham and Box (1978), and 
Zellner (1985). 

Although these manipulations are carried through formally in Section 2.1, 
treating m,,..., Mg in (2.1) as arbitrary functions, the Bayesian character of 5, 
suggests that desirable properties may be obtained when these functions are at 
least approximations to marginal densities. However, one drawback is that when 
m,,..., Mg are not marginal densities corresponding to bonafide priors, the 
weights w,,...,W, lose their interpretation as prior probabilities in the mixture 
prior 7,. Nonetheless, it may be useful even in non-Bayes examples of 5,, to 
consider calibrations of these weights which roughly reflect the statistician’s prior 
probability or degree of belief in the potential effectiveness of the estimators 
8,,...,5,%. Although the choice of prior weights ultimately corresponds to the 
choice of a risk function, such an interpretation may facilitate their specification 
in practice. 


3. A multiple shrinkage Stein estimator. In this section, we consider the 
special case of ô, in (2.1) obtained when 6,,..., 5, in (2.2) are general positive-part 
Stein estimators. Other examples of multiple shrinkage estimators have been 
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considered by the author in George (1986a, 1986b, 1986c). The following notation 
will be used throughout. Let 
Vi; eesy Ve 


denote a set of (possibly affine) subspaces of R? such that V, has dimension 
P — q, where q, > 3. For any Y € R”, let P,Y denote the projection of Y onto 
Vp defined by ||Y — P,Y|| = min, cy,||¥ — oll. For convenience, let 


s,(Y) = |Y — P,Yi? 
denote the squared distance from Y to V,. 


3.1. Construction of a multiple shrinkage Stein estimator. As a more general 
version of the example described in Section 1, suppose vague or conflicting prior 
information suggested that small risk might be obtainable by using one of the 
following K positive-part Stein estimators, 59,..., 63, which shrink Y towards 
the subspaces V,...., Vx, respectively, 
qR- 2 
s,(Y) 
where a A b = min{a, b}, see Sclove, Morris, and Radhakrishnan (1972). For 
example, the estimator 89 in (1.3) is a special case of ô$ when V, =v € RP, 
a, =P, PY = ct, and s,(Y) = ||Y — vl]. Another common choice [see Lindley 
(1962) and Efron and Morris (1975)], is V, = [1,], the subspace spanned by the 
vector 1, = (1,...,))’, in which case g,=p-—1, P Y= Y1,, and s,(Y)= 
IY — Y1,||? where Y = LP, Y/n. 

Typically, the targets V,,..., Vx would correspond here to several guesses for 
the approximate location of 8. As distinct from the example in Section 1, this 
more general situation allows for overlapping targets; V,,..., Vg might even bea 
sequence of nested subspaces. As is well known [and is illustrated by (3.9)], each 
ôf yields meaningful risk reduction over 5M“ only when 8 is close to V,, and this 
reduction is larger when V, has smaller dimension; indeed, when 6 € V,, R(0, ô$) 
is slightly less than p — q, + 2. Thus, when the prior information was correct 
that 6 was close to one or more of V,,..., Vx, some of the estimators 63,..., 82 
could offer substantially smaller risk than others. Failure to choose a more 
effective 5$ would then result in foregoing large potential risk reduction. 

To avoid the limitation of choosing a single Stein estimator for this situation, 
we construct a multiple shrinkage alternative. Generalizing the expression in 
Stein (1973) for the case V, = 0, each of the estimators in (3.1) is of the form 
63(Y) = Y + V log m{(Y) as in (2.1), where 





(3.1) ôp(Y) = Y- h A kr- PY), k=1,...,K, 





— 2? j 
(3.2) v log m3(Y) = - f A T v- P,Y) 
when 
(3.3) mS(Y) Z (Cay = 2) /es,(Y))\" 7 if 8,(Y) 2 (a 2), 
DE if sa(Y) < (a= 2). 
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Applying the construction in Section 2.1 to 6%,..., 63, thus yields the multiple 
shrinkage Stein estimator 


K 
(3.4) d¢(Y)=Y+v logm$(Y), where m$(Y)= Ð} w,m$(Y), 
k=l 


a special case of 6, in (2.2) where m, = m$, m, = m$, and w,,..., Wg are prior 
weights as in (2.3). Note that although each m$ is determined by (3.2) only up to 
a proportionality constant, to facilitate comparisons we have scaled m$,..., mẹ 
in (3.3) to be equal when s, = ++- =6&,=0. It should be emphasized that 
m},...,m% are not real marginal densities so that w,,...,w, will not be real 
prior probabilities here. Nonetheless, it may be useful to regard each m$ as an 
estimate of an unknown marginal (see Section 3.4). When V,,..., Vg are equidi- 
mensional, so that q, = --- = qx, it may be reasonable to treat w,,...,Wx as 
prior probabilities; by symmetry considerations, the normalizing constants which 
would relate the m$ to real marginals would then be the same. However, when 
qis -- -+ qg are unequal, the absence of an appropriate normalization of m$,..., m3 
makes any such interpretation more tenuous. 

As in (2.4) and (2.5), the following reexpressions show how 84 is an adaptive 
convex combination of the estimators 88,..., 8$, 





(95) 887) = ¥- E effin Bee e- ayy É anao), 
k=l 8,(Y) k=l 

where 

(3.6) OY) = wymi (Y)/m$ (Y). 


The behavior of ô$ is intuitively appealing. First of all, when Y is far from all the 
targets, ô$ behaves essentially like 6@“® since the shrinkage provided by each 62 
is trivial. To describe the behavior of ô$ as Y approaches the targets, it is useful 
to begin with the special case of equidimensional targets, q) = <->- = qg, and 
uniform prior weights, w; = --- = wg. In this case w,m%,..., wgmẹ$, are identi- 
cal decreasing functions of s,,...,8,, so that p$ > p} iff s< s, Because 
[1 A (q — 2)/s,] is also decreasing in s}, 62 puts more weight on those 6% 
which are shrinking most. Effectively, 8$ shrinks Y in the direction of the closer 
targets, and the magnitude of shrinkage increases with the proximity of Y to 
these targets. Use of nonuniform prior weights proportionately changes the 
relative weighting of 59,..., 5%, changing the magnitude and direction of shrink- 
age by 8% accordingly. However, because p%,..., o% are so sharply adaptive, 
especially when q,,..-,;@x are large, ô$ will essentially emulate 6? when Y is 
close to V, and no other target, as long as the prior weights are not too disparate. 

In the general case where q,,...,@, are unequal, the functions mẹ for which 
q, is larger, decrease more rapidly. Unless w, is chosen larger when q, is larger, 
6% may fail to exploit very much of the shrinkage potential of the 5? correspond- 
ing to the lower dimensional targets. For example, when q, > q, p/p; may drop 
off very quickly as s, increases, especially if the targets were nested, V, C V} 
Setting w, = w, might result in p$ « p? even when (q, — 2)/8, > (qi — 2)/8; 
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and ôf is shrinking more than ôf. This behavior can be roughly avoided by using 
the calibration 


(3.7) w, = (ce), k=1,..., K, 


which for c > 1 forces p{ > pf when (q, — 2)/s, > (qi — 2)/s, = 1/c and qg > qr 
In the next section this calibration is seen to be reasonable from a risk perspec- 
tive. Various choices of c are briefly examined in one of the simulations in Section 
3.3. 


3.2. The risk of a multiple shrinkage Stein estimator. The application of the 
results of Section 2.2, shows that 62 inherits desirable risk properties from 
88,..., 68. To begin with, 6 is minimax. This property follows from Corollary 2 
and the superharmonicity of m%,..., mẸ which is demonstrated by 
0 if s,(Y) = (q4 — 2), 
—(q,-8,(Y)) if s,(Y) < (a - 2). 

Note that (3.8) and Corollary 1 provide an immediate verification of the well- 
known minimaxity of 89. 

The following unbiased estimates of risk reduction provide some insight as to 

the regions of the parameter space where 52 may potentially offer meaningful 


risk reduction. Inserting (3.2) and (3.8) into (2.6) in Theorem 1, yields the risk 
reduction estimate for 62, 


2 d 
(3.9) D8§(Y) = tes = 2)'/s,(¥) if 9,(¥) = (q4 - 2), 
2q, — 8x(Y) if s,(Y) < (gx — 2), 
a slight generalization of the result in Stein (1973) for the case V, = 0. By 


Corollary 3, the risk reduction estimate for 52 may be expressed in terms of (3.9) 
as 


(3.8) v?mz(Y)/m{(Y) = | 


K K 
(910) DET) = X APDR -F E AOE -| 


We should point out that although Dd? and Dô$ are useful for making risk 
comparisons, they are not always reasonable as estimates of risk. For example, 
D8? > p, which occurs when q, =p and s, is small, leads to a negative risk 
estimate, which is silly. 

Comparison of (3.5) and (3.10) shows that Dd$ adaptively emulates the risk 
estimates D8$,..., D5 much in the same way that ôf adaptively emulates the 
estimators 55,..., 5. Consider first the equidimensional and uniformly weighted 
case where pj > pj iff s, < s,. Because D83 is decreasing in s,, p3 and DSS will 
be large simultaneously. Since p%,..., % are so sharply adaptive, D8$(Y) = 
max, D63(Y) whenever Y is close to some V,, suggesting that R(9, 53) = 
min, R(6, 63) whenever 0 is close to V, U --- U Vg. Of course, we do not believe 
(although we have not been able to prove it) that ô$ will dominate 5? when 
6 & V,; intuitively, when 0 € V,, ôf will always shrink in the correct direction, 
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whereas ô$ will not. Although ô$ may not provide quite as much risk reduction 
as the most effective 62, the increased size of the region of improvement may be a 
very desirable trade-off. Indeed, the simulation results in the next section suggest 
that the approximation of R(@, 5%) to R(0, 8f) when 6 is close to V, can be 
excellent. Note that by increasing w,, one can improve this approximation, 
although it would be at the expense of less risk improvement near some of the 
other targets. 

In the general case where q,,...,q are unequal, the form of Dé? suggests 
that uniform prior weights are less desirable because for larger q}, p} may drop 
off very quickly as s, increases, especially when V, was nested in a higher 
dimensional subspace. Instead, it seems desirable to choose w,,...,Wx so that 
of(Y) ~ 1 when D83(Y) = max, D6$(Y) and s, is small. Analogously to the 
equidimensional case, such behavior would yield D&$(Y) ~ max, D5$(Y) and 
consequently R(6, 8$) = min, R(8, ô$), when Y or @ was close to V, U --- U Vy, 
respectively. The calibration suggested in (3.7) seems to roughly achieve this goal, 
as is borne out by the simulations in the next section. 


3.3, Simulations of the multiple shrinkage Stein estimator. To gain some 
idea of the potential quality of the approximation of R(#, 63) to min, R(0, 62), 
we obtained Monte Carlo estimates for the case p = 10, of the risk of ô$ and the 
corresponding Stein estimators for simple examples of the equidimensional target 
case and the nested subspace target case. The risk of each estimator for each 
choice of 6 was estimated by the average loss ||ô — ||? based on 10,000 indepen- 
dent samples of Y ~ N,,(@, I). (The normal random deviates were generated 
from the IMSL routine GGNML.) In assessing the potential practical value of 
the estimates, recall that R(@, MHE) = 10 here. 

In the equidimensional case, we simulated the risk of two Stein estimators 5? 
with V, = v, € R®, k =1,2 and two choices of the corresponding multiple 
shrinkage estimator 63 with r = w,/w, = 1 and 9 (K = 2). Three choices of v, 
and v, were considered, corresponding to the separations d? = ||v, — vall? = 
2.5, 10,40, obtained by changing each coordinate 0.5, 1,2 standard deviations. For 
each separation, eight values of 6 = (1 — A)v, + Av, obtained by varying A = 
— 0.5, 1.5 (0.25) were considered. The risk estimates, which appear in Table 1, 
show that the risk reduction of ô$ is impressive. When r = 1, the performance of 
82 at the separation of d? = 40, is essentially indistinguishable from the best of 
52 and 89. For the smaller separations d? = 2.5, 10, the performance close to the 
targets deteriorates only slightly, although it improves between the targets. For 
the nonuniformly weighted case with r = 9, the performance of 52 improves 
slightly when @ is close to v,, and deteriorates slightly when @ is close to v, 
apparently the result of the strongly adaptive relevance functions. 

In the case of nested subspace targets, we considered eight Stein estimators 59, 
k=1,...,8, for which V, = {v e R®: v'=0 if i= k} where v' is the ith 
coordinate of v, and six choices of 62 (K = 8), using calibrations of the prior 
weights given by (3.7) with c = 1,2,3,5,10,50. The risk of these estimators was 
compared for 0 = 0 and for eight choices of ||6\|? = 6? = 40, i = 1,...,8. These 
values of 8 were chosen because setting ||@||? = 6? = 40 effectively eliminates the 
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TABLE 1 
The risk of 8$ when Y ~ N,o(8, 1)—the equidimensional case 
Pa ea aaa a 
@= (1 — A)v, + Av, 


Ae —050 -025 0.00 025 080 075 1.00 1.25 1.50 
op re a SS 
d? = 40 
R(0, 84), R= 1 6.2 3.2 1.3 3.2 6.1 7.8 8.6 9.1 9.4 
2 9.4 9.1 8.7 7.8 6.2 3.2 1.3 3.2 6.1 
R0,83) r=] 6.2 3.2 1.3 3.6 6.1 3.6 1.3 3.2 6.1 
9 6.2 3.2 1.6 4.3 6.1 3.3 1.3 3.2 6.1 
d? = 10 
R(6, 83), k=l 3.2 18 1.3 1.8 3.2 4.8 6.1 Ti 78 
2 7.9 72 6.2 48 3.2 1.8 1.3 1.8 3.2 
R(6, 6%), r=1 3.3 2.1 1.8 2.2 2.5 2.2 1.7 2.1 3.3 
9 4.0 3.1 3.0 3.1 2.7 1.8 1.4 1.9 3.2 
d? = 25 
R(6, 83), k=1 1.8 1.4 1.3 14 1.8 2.4 3.1 4.0 4.8 
2 47 3.9 3.1 24 1.8 1.4 1.3 1.4 1.8 
R(0, 82), r=1 23 1.9 1.6 1.5 1.4 1.5 1.6 1.9 2.3 
9 3.6 3.0 25 2.0 1.6 14 1.3 1.5 1.9 


emmma 


Note: 10,000 replications. The standard error of each estimate 1s less than 0.04. 


TABLE 2 
The risk of 83 when Y ~ N,o(8, 1)—the nested case 


jel? = 6? = 40; i = 
@=0 1 2 3 4 5 6 7 8 
R(0, 88), k= 1 1.3 86 86 86 86 86 86 86 86 
2 2.3 23 89 90 89 89 89 89 3.9 
3 3.3 33 33 92 92 92 92 92 9.2 
4 43 43 43 43 94 94 94 94 9.4 
5 5.4 54 54 54 54 96 96 96 9.6 
6 6.4 64 64 64 64 64 98 98 9.8 
7 7.5 75 75 75 75 75 75 99 9.9 
8 8.6 86 86 86 86 86 86 86 10.0 
R(6, 62), c= 48 54 59 64 70 75 81 88 9.9 


1 

2 2.3 3.3 4.3 5.3 6.3 7.3 8.2 9.0 9.6 
3 1.8 2.9 4.0 5.1 6.4 7.6 8.5 9.0 9.1 
5 1.5 2.6 3.9 5.5 7.1 8.3 8.7 8.8 8.8 
0 1.4 2.6 4.4 6.8 8.3 8.7 8.7 8.7 8.7 
50 1.3 3.1 7.0 8.6 8.6 8.7 8.6 8.6 8.7 


Note: 10,000 replications. The standard error of each estimate is less than 0.05. 
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useful risk reduction of those 62 for which k < i. The risk estimates, which 
appear in Table 2, show that from a practical point of view, the approximation 
R(6, 62) ~ min, R(G, 8f), can be excellent when 8 is close to any of the targets. 
Indeed, when c = 2, R(@,8$) < min, R(6, 8$) + 1 for each 8 considered. As c is 
increased, the improvement at the smaller dimensional targets is improved, 
though at the expense of some deterioration at the other targets. The calibration 
of prior weights given by (3.7) seems to work quite well here, and yielded better 
results than other calibrations that we tried. Finally, to end on a cautious note, 
this second simulation explores a very small region of the parameter space. Before 
82 can be used with confidence in a nested subspace situation like this, a much 
more comprehensive simulation study would be needed. 


3.4, An approximation for a family of mixture priors. Although ô$ is not a 
Bayes rule, it may be useful to regard it as an approximation to Bayes rules. Such 
an approximation is suggested by the empirical Bayes relationship of 52 to the 
Bayes rule 





E,(0Y) = ¥ - | jv- ay. 


(3.11) 1+a, 


where 7,(@) = (2ra) P el-e? 2a, 
when it is assumed only that 7, belongs to the family of conjugate priors 
(3.12) T, = {7,(@): u E V, and a, = 0}. 


ô$ is typically motivated as an empirical Bayes approximation to E,(6|Y) by 
inserting the estimates 


(3.13) f,=P,Y and 4, = max{0,(s,(Y)/(q,— 2)) — 1} 


into the left-hand expression in (3.11) [see e.g., Stein (1962), Efron and Morris 
(1973a), Zellner and Vandaele (1974), and Morris (1983)]. Because 





(3.14) E (0Y) = Y+ v logm(¥\x,) = Y- | i j- Br) 
ak 

where 

(3.15) m(¥|a,) = (27(1 + ap)) PET Heal +a, 


m{(Y) may then be regarded as an estimate of the marginal density m(Y|7,) (up 
to a proportionality constant), implicitly determined by ĝ, and 4, (or equiv- 
alently 5$(Y)), through (3.14). Note that m{(Y) is not obtained by inserting the 
estimates fi, and â, directly into m(Y|z,) in (3.15). 

By treating m$(Y) and 5$(Y) as estimates of m(Y|7,) and E,(6|Y), 53 may 
then be regarded as an approximation to the Bayes rules for the family of 
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mixtures of conjugate priors, 
(3.16) r,= (re T0) = z wim, (0), where m, € ry. 
=1 
since each of these Bayes rules may be expressed as 


K 
(3.17) E, (01Y) =Y + v logm(Y|7,), where m(Y|7,) = }, wym(¥|n,) 
ko) 


or 


K 
(3.18) EOY) = È PlnlY)E OY), 


where P(7,|Y) = wm(Y|n,)/m(Y |r x). 


Because of the absence of meaningful norming constants for m$,..., m3}, the 
prior probabilities w/,..., w% in (3.16)-(3.18) may differ from w,,..., Wx. 

The family T, generalizes the family T, in (3.11), allowing for much more 
flexibility in the specification of the location of the prior mean. Note that 
although the empirical Bayes approach of inserting parameter estimates has been 
used successfully with families of contaminated mixture priors by Berger and 
Berliner (1983, 1984), insertion of the estimators ñ, and 4, directly into E, (6|Y) 
into (3.17) or (3.18) would not yield 52. Indeed, the resulting estimators appear 
not to be minimax in general [see George (1986c)]. 

It is interesting to contrast 62 with the Bayes estimator E, (Y|6). Both the 
relevance function pj and the posterior probability P(7,|Y) are “adaptive and put 
increasing weight on the estimator which is supported by the data. However, 
each 6? shrinks less when Y is further from V,, in sharp contrast to E „(8Y ) 
which shoul more. Only 52 possesses the robust property of behaving like MLE 
when Y is far from all the targets. 


4. The general case. As a generalization of the situation in Section 2, 
suppose vague or conflicting prior information suggested that small risk might be 
obtainable by some member of a possibly infinite set of estimators, 


(4.1) Ag = {6,: 6,(Y) = Y+ V logm,(Y), v € Q}, 


where for each w in the indexing set 2, m,: R? > R*N{0}* is such that m,, and 
Vm, are a.d. Let W be a probability measure on Q such that for ae. y € RP, 
m.(y) is a measurable function of w wrt W, and 


(4.2) m4(¥) = fm(¥)W(de) 
exists and is such that V and f may be interchanged to yield, 


vm,(¥) = | vm(Y)W(dw) 
(4.3) a 
and v?m,(Y) = [-v?m,(Y)W(de). 
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Note that any discrete finite probability measure W will always satisfy these 
conditions. With this setup, a multiple shrinkage estimator may be defined as 
(4.4) ôa = Y + v logm,(Y), 

and may be reexpressed as 


(4.5)  8,(Y)= Y+ fv log m„(Y)o(Y, dw) = [8.(¥)o(¥, dw), 


where 
(4.6) p(Y, dw) = m,(Y)W(dw)/m,(Y). 


The probability measure W generalizes the prior weights w,,..., Wx in (2.3), and 
the adaptive probability measure p(Y, dw) generalizes the relevance functions 
o,(Y),..-, Px (Y) in (2.5). Indeed, when W is a discrete finite probability measure, 
6, in (4.4) reduces to 6, in (2.2). 

As in the discrete case, it is of interest to apply Stein’s Theorem 1 and 
Corollary 1 to this general version of 6,. The following analogues of Lemma 1 
and Corollaries 2 and 3, which are proved similarly, depend on both Ag and W. 


LEMMA 2. If Ag and W are such that 


(i) Bef |v2m(¥)/m(¥)|o(¥, de) <2, i= LD, 


(ii) E, |v log m(¥) | o(¥, do) < 00, 
Then 5, satisfies the conditions of Theorem 1. 


COROLLARY 4. If&p and W are such that the conditions of Lemma 2 are 
satisfied and each m , € Ag is superharmonic, then 6, is minimax. 


COROLLARY 5. If Ag and W are such that the conditions of Lemma 2 are 
satisfied, then 


(4.7) D8,(Y) = {| a) ~ f8) — & (Y) (Y, dn)jp(Y, dw). 


Also, note when each 6, € Ag is a Bayes rule with respect to a prior m8), 
then ô, = E, (6|Y) will be the Bayes rule corresponding to the mixture prior 


(4.8) w4(8) = [2(6)W(de), 
generalizing the motivation in Section 2.3. 
EXAMPLE 1. Shrinkage towards an arbitrary set. Suppose interest was ini- 


tially focused on using a Stein estimator of the form 6% in (1.3), but vague prior 
information suggested only that 6 was close to some set A C RP. Instead of 
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choosing an estimator from the set 
(4.9) A, = {88: 88(Y) = Y + v log m$(Y), v € A}, 


where m$(Y) is the special case of m$(Y) in (3.3) when V, = v, a more desirable 
estimator may be a multiple shrinkage Stein estimator of the form 


(4.10) 88(Y)=Y+vlogm§(Y), m§(Y)= 7 mS(Y)W(do), 


where W is some probability measure on A such that m$(Y) exists and (4.3) 
holds. For example, if available prior information suggested only that |j@|| = r > 0, 
then appropriate choices for A and W would be B, = {v € RP: |[o]| = r} and the 
uniform measure on B,. Alternative estimators which shrink Y towards B, have 
been considered by Bock (1983) and George (1986c). 

Although the conditions of Lemma 2 must in general be verified for each 
choice of A, and W, it can be shown that these will hold whenever A is bounded. 
Thus, by Corollary 4, any choice of ô$ in (4.10) with A = B, will be minimax. 


EXAMPLE 2. Shrinkage towards a subspace measured with error. Consider 
the situation where 6 was thought to lie close to [X], the subspace spanned by 
the columns of a p X n matrix X (n < p — 3), and interest was initially focused 
on using a Stein estimator of the form 62 in (3.1) with V, = [X]. However, 
suppose that these columns were covariates observed with error; that only 
X, = X + € was available, with ¢ an unobservable p X n matrix of errors with 
distribution Y. Instead of choosing a Stein estimator from the set 


(4.11) Ag = {88: 8$(Y) = Y + v log m$(Y), £ € Q}, 

where 2 = R?*", and m3(Y) is the special case of m3(Y) in (3.3) with V, = [X;], 
it may be more desirable to use a multiple shrinkage Stein estimator of the form 
(4.12) 688(¥)=Y+vlogm$(Y), m$(Y¥)= f mix) ¥(ae) 


when X and W are such that m$(Y) exists and (4.3) holds. As in Example 1 
above, to apply Corollaries 4 and 5, the conditions of Lemma 2 must in general be 
verified for each choice of X and ¥. 


5. The case of unknown variance. The multiple shrinkage estimator 6, in 
(2.2) or (4.4) is easily extended to handle the more realistic situation where 


(5.1) Y|@,0 ~ N (0, 0°I), 
and an independent estimate of o? is available, namely 
(5.2) S ~ o?°xh, 


where x2, is the chi-square distribution with d degrees of freedom. Simply replace 
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6,(Y) = Y + V log m,(Y) by the multiple shrinkage estimator 


S 
(5.3) ôL(Y)=Y+ ag © log m,(Y). 


When ô, satisfies the conditions of Theorem 1, it is easy to see from the main 
results of Stein (1981) (Section 8) that 6% has risk 


(5.4) R(6,0, 5%) = Ey, l8 — 831? = o°] p — Eg, a D84(¥/c)|, 


d+2 


where Dô,(Y/o) is given by (2.6). The generalization of the other results is 
straightforward. As Stein points out, the reduction in risk due to not knowing o? 
is only reduced by a factor of d/(d + 2). 
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A new class of solutions to a general differential inequality often encoun- 
tered in multiparameter estimation problems ıs obtained. Using these solu- 
tions as guidelines, improved estimators for the scale parameters as well as 
the natural parameters of independent gamma distributions are obtained for 
a large class of weighted quadratic losses. The improved estimators have an 
empirical Bayes interpretation. They also permit an exact analytical represen- 
tation of the msk improvement. For the ordinary squared-error loss, a larger 
class of mproved estimates 1s obtained which may allow for incorporation of 
prior information in choosing an alternative estimate. Numerical results are 
given which indicate the extent of risk improvement in certain situations. 


1. Introduction. Since the pioneering work of Stein (1956), a great amount 
of research has been done on exhibiting the presence of the Stein effect in various 
probability structures with an infinite number of points in the sample space. 
Scattered inadmissibility results were gradually unified after the powerful tech- 
nique of improving upon an inadmissible estimator via integration by parts was 
found by Stein (1973). The technique of explicitly constructing improved estima- 
tors by solving differential (difference) inequalities on the sample space has since 
been very productively used by many authors, notably, Hudson (1978), Berger 
(1980a), Hwang (1982), and Ghosh, Hwang, and Tsui (1983). The beauty of the 
method lies in the facts that it often allows for consideration of a large number of 
losses of general quadratic form since the solutions to the differential inequalities 
often follow a general pattern and that the technique also offers one a choice 
from a big class of improved estimators. (For an indication of how this scope for 
choice leads to highly interesting selection problems, see Berger (1982)). However, 
one should perhaps mention in the same breath that the improved estimators 
thus obtained may be most extremely loss-specific and also moderately to highly 
unwieldy; in fact some of the recent skepticism about inadmissibility results has 
a lot to do with these undesirable features. 

This paper deals primarily with simultaneous estimation of parameters in 
independent gamma distributions, although some of the results in the next 
section extend to the problem of estimating the vector of natural parameters in 
the general continuous exponential family. In Berger (1980a), the problem of 
estimating the vector of scale parameters 6;',6;7,...,0;' of p independent 
gamma distributions was considered for four different losses D?_,0,"(5,8, — 1)”, 
where m = —2, —1, 0, or 1, and explicit improved estimators were obtained. 
Unfortunately, the improved estimators looked completely different for the four 
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different losses and their functional form did not allow for study of actual risk 
improvement except through possible simulation. (This of course is the typical 
picture in multiparameter estimation problems.) In the next section we shall 
treat the problem of estimating gamma scale parameters under general weighted 
quadratic losses L?_,c,67""(6,0, — 1)? where c, and m, (+ 0) are any constants. 
Losses of this kind will obviously include three of the four losses studied in 
Berger (1980a) as special cases. (The excluded case m = 0 refers to the invariant 
quadratic loss. Why the invariant loss has to be left out will be discussed later.) 
Note also that similar losses were considered by Hwang (1982) in the multi- 
parameter Poisson problem and analogs of these losses in the normal case were 
considered in Brown (1980), although with a somewhat different purpose. Im- 
proved estimators are obtained for all these losses for p > 2 and it will be seen 
that the improved estimates are functionally similar and look alike for the 
different losses. Improved estimates are also obtained for a variety of losses for 
the natural parameters and again they are functionally similar for different 
losses. 

Berger (1980a) showed how certain terms in his differential inequalities played 
the dominant role in obtaining improved estimators. It will be seen that solutions 
to the dominant inequality suggest improved estimators in all the problems we 
consider. These suggested estimators can then be shown to be actually dominat- 
ing by exactly calculating the risk difference. Towards this end, in the next 
section we obtain a new class of solutions to the general differential inequality 


p Pp 
(1.1) A(x) = y(x) È o(x,) gO + X w,(x)¢i(x) < 0 
re] il 

first studied in Berger (1980a). Solutions were first found by him and then his 
class of solutions was extended by Ghosh and Parsian (1980) in the spirit of Efron 
and Morris (1976). Our solutions are new and these are then used to form possibly 
improved estimators in the gamma problems. We then calculate the risk dif- 
ference analytically and show that the estimators heuristically obtained are 
indeed improvements in terms of risk for all the losses mentioned above. The 
scale parameters as well as the natural parameters are considered. Next, for the 
ordinary squared-error loss (m = —2), the inadmissibility results have been 
extended to give a broader class of improved estimators. For the ordinary 
squared-error loss, we also show that our class of improved estimators has an 
empirical Bayesian justification. The question of the actual amount of risk 
improvement is of great interest in practice. Since our improved estimators 
permit exact analytical representation of the risk improvement, we have studied 
this aspect analytically to some extent and then actually calculated the per- 
centage risk improvements in some situations. For the ordinary squared-error 
loss, there is considerable improvement in terms of risk. 


2. Construction of improved estimators. In this section we first obtain a 
new class of solutions to the differential inequality (1.1). The importance of this 
general differential inequality in multiparameter estimation is now very well 
known. See Berger (1980a) and Ghosh and Parsian (1980) for an extensive 
discussion. 
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THEOREM 1. Consider the differential inequality (1.1), and assume (x) > 0. 
Define g (x,) as gi(x,) = 1/(v,(x,)). Suppose for some a + 1/(p + 1), for some 
k>0, 


1 2 ; 
(2.1) Va) L w(x)\g(x,)|°* <k forall x. 
Then $ = (i, $95---, %) with 
(2.2) $,(x) = esgn g,(x,) Ila.) | 7le(e)!* 


solves A(x) < 0, whenever c{a( p + 1) — 1} + ke? < 0. 
Proor. Clearly, for almost all x, 
ð 
(1) Dn 
Sa) = Eala) 


= ca(sgn g,(x,)) le) TTI lea) a e) 
(2.3) mi 





+c =- 2 (sgn g(x, lea i lg (x) P ela, 
BaD H y Hae ea 
Hence, ; 
a z p —1)/p = a-i 
ae) = v2) EREDT file È lela) 
seie PTEE lale 
A E 
yaa D/P 2a 
n Wol Dle) | È mls late | 
a Aapki 





< (2) flee] = $ etal 


2 2 (a~1)/p 
+e kT etx) 
fe 


e[a(p+1)-1] 
P 


be z c°k a a-i 
x È lae È reo 
t=] =l 


<0, ife{a(p+1)-1}+ke?<0.0 


<vo Fleet] 
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REMARK 1. Under appropriate conditions similar to condition (2.1), a class of 
solutions to A(x) < 0 was found in Berger (1980a). The solutions there were 
essentially of the form 


— 8,(%,) 
b + LP ig (x)? : 


where b, c and $ are suitable positive numbers. The constant c can be gener- 
alized to an appropriate function r(x) (see Ghosh and Parsian (1980)). 


REMARE 2. The new class of solutions in (2.2) can be used to anticipate 
improved estimates of the scale as well as the natural parameters of independent 
gamma variables under a wide class of weighted quadratic losses. At this stage, 
we merely mention that if the loss is E£. ,n™(8,n, — 1)? (where n, = 8, or 1/8,), 
one may heuristically arrive at [x,/(a + DJL + ea™ ML x ™ 2P)] 
and [(a, — 2)/x,][1 + cx; ™ CIP x" PY] as possibly improved estimators of 
the scale and natural parameters, respectively, by considering certain dominant 
parts of the relevant differential inequalities (see Berger (1980a)). 


We now actually prove that the estimators heuristically obtained above 
dominate the standard estimators in terms of risk. 





(2.5) (x) = 


THEOREM 2. Let X, ~ dep, Gamma (a1, 4), pees a, s are considered known. 
Consider the problem of estimating (871,05 ',...; 6-1) under a loss L(6, 8) = 
Erc, 8m (1 — 6,0,)", where c,> 0, m, #0, ond T= 12. .., p. Consider the 
estimates ô (x) and 6(x) defined as 


x, 
ôo, (2) A oe 1 





(2.6) 
(x) = 





where A 
OTEA mara Tym), 


t=] 


c>0 sufficiently small (see (2.10)). 
Then R(6, 8) < R(0, ô) for every 6 if p = 2 and hence 6,(X) is inadmissible. 


Proor. Letting A(8) = R(8, 8) — R(8, 8), 











A(8) 
2 x, xol) Y 2 a) ; 
= m, oe E gm pai 
P D obi (4 id ead È cð, (4 s ) 
(2.7) P x? P ; 
E cort g(x) + 2 cpm + @ (x 
|È a +e”? L ee 





aoe Ce (xìj. 
t=] a, +1 
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Assuming that a, > m,/p, 
(m (1 — p))/2p, by direct calculations 




















a, + 2 > (m,(1 — p))/p, 


c’T(a, + m,+2-—m,/p) 


and a, + 1> 


= Tr(a,- m,/p) 














Ex*@? x)= poms gm /P 
Rea) le) 
~ce(sgn m,)T(a,+ 2+ m,/2— m,/2 
Bx%,(x) = (sgn m,)T(a / /2p) 
T(a, Ti m,/2p) 
P Ta 2 
(2.8) Sd a a Tl Ta, — m,/2p) am (aR. 
gal Ta,) 
-c m,)I(a,+ 1 + m,/2—m,/2 
Exx) = (sgn m,)T(a / /2p) 
(a, = m,/2p) 
aceon ny T= MP) pu ny 
i j=l I'(a,) 
Setting 
T(a,+m,+2—m,/p) I'(a, — m,/p) b 
T a,, ~ t? 
I'(a, — m,/p) I'(a,) 
T(a,+1+m,/2~—m,/2p) I'(a,— m,/2p) 
=d,, and E, 
Ta, — m,/2p) T'(a,) 
(2.7) and (2.8) give 
A(6) 
= ey 10am) 
P c(sgam m m p 
— 2e a asne T] egr] 
(2.9) 1=1 (a, P 1) 2 P 1m1 
P c (sga m,) | z | 
+2e Y = 0 7| T] e077? 
2 ary or 
c, [mip — 1) | 2 | 
= ck b, grm /P — 2e 6" 72 2,97 /2P í 
. í > (a, + 1) 2p o 
where k = D?,c,a,/(a, + 1}. Let b= max b, e= mine, and d= 


min ¢,|m,|d,/(a, + 1). Then (2.9) yields, 


: ode"(p~1) P 


A(@) < c2kb? []0™/? — 


t=1 


(2.10) 


t=] 


<0, #f0<c< LEP 


de?( p — I: 


p 
[I gee 2 Jik 


1m] t=1 


p 
< c [| 0™/*{ckb? — deP( p — 1)} 


ESTIMATION IN THE MULTIPARAMETER GAMMA DISTRIBUTION 211 
This completes the proof of Theorem 2. 0 


REMARK 1. The bounds following (2.9) have been rather crude; for the 
special case a, = a, c, = 1, and m, = m, (2.9) gives 


cp T(atmt+2-m/p) 








ad EE T(a — m/p) 
Ta - m/p} ejm|( p — 1) 
ee ete gm/P — Sia bier ene 
x| Ta PL i (a+ 1)’p 
(2.11) T(a+1+m/2-m/2p) P se 
a- m/p) 2% 
(a — m/2p) nee 
x( Te ae Ie ý 
<0 
if 
p < o Mule =D) Mat 1 + m/2 ~ m/2p)(L(a -= m/2p))" 





P T(a+m+2-m/p\(T(la- m/p) 


Further specializing to the case m = —2 (ordinary squared-error loss), the range 
of cis0 < c< 2X(p-—1)/p (Tla + 1/p)/T(a + 2/p))?. It is clear that there is 
no unique value of e which maximizes the improvement in risk. However, 
from (2.11) it is clear that the upper bound on A(@) is minimized at 
c = (p — 1)/p ; (T(a + 1/p)/T(a + 2/p))”. Thus there is no optimal choice of 
c but if one has to choose one, the midpoint of the allowable values may be a 
natural choice. Recent results in Berger and Das Gupta (1985), however, indicate 
that the upper bound of c may be the most appropriate choice in many restricted 
risk Bayes problems. 


REMARK 2. For the squared-error loss (m, = —2), the improved estimate 
shifts by a multiple of the geometric mean. In a recent paper, Das Gupta and 
Sinha (1984) have shown that for estimating L?_,/,67' under squared-error loss, 
o?_,1,X,/(a, + 1) is inadmissible and the eneroved estimate also shifts by a 
multiple of the geometric mean. 


REMARK 3. The shrinkage behavior of our improved estimators follows the 
same pattern as the improved estimators in Berger (1980a); thus if m, < 0, the 
improved estimate expands the natural estimate of that coordinate and if m, > 0, 
the improved estimate shifts by a negative quantity. 


REMARK 4. For the invariant quadratic loss E£. (1 — 6,6,)*, the estimate 
8(x) with ô (x)= x,/(a + 1) + c/(a + 1)(sgnlog x, x, Tl. > llog x, |7P is sug- 
gested as the alternative estimator by the solutions to A(x) < 0 in Theorem 1, 
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For such an estimate, an analytical representation of the risk is difficult to 
obtain; hence the natural way to prove that it dominates the standard estimate 
would be by using the technique of solving an exact differential inequality (not 
just the dominant terms); as is well known, certain tail and integrability restric- 
tions must be imposed on the solutions for inadmissibility to be proved. These 
integrability conditions are not met by the estimators described above and hence 
the invariant loss has to be left out. We remark that for the invariant loss the 
improved estimates obtained in Berger (1980a) are probably the most natural 
because on making a log transform, they resemble the James—Stein estimators. 


REMARK 5. Finally note that the improved estimates of Theorem 2 allow for 
smaller values of the shape parameters a, to be accommodated than could 
previously be done. For example, if all m, = 1, then Theorem 2 only requires 
a, > 1/p for every i, while Berger (1980a) requires a, > 4 for every i if a,s are 
possibly unequal. Note that for large p, a,>1/p will probably be satisfied 
anyway. 


We now state a general inadmissibility theorem for estimating the natural 
parameters. The proof will be omitted because of similarity to the proof of 
Theorem 2, 


THEOREM 3. Consider the problem of estimating the natural parameters 
(6,, 9,...,6,) of p independent gamma. distributions under a loss L(8, 8) = 
m_c,6(8,/8, — 1}, where c, > 0, m, #0 are some constants; assume a, > 
m,/p, 4,—-2>m,1—p)/p, and a,—1>m,(1 — p)/2p. Consider the esti- 
mates ô (x) and 5(x) defined as 
a — 2 
ôo, (x) i 3 
x 
(2.12) ‘ 


a,— 2 ar 0 B 
8, (x) = —— 1+ e(sgn m,)x™ [I x; ala |, 


t 





where c is a sufficiently small positive number. 
Then R(0, 8) < R(0, 5,) for every 8 if p = 2 and hence 6,( X ) is inadmissible. 


As in the estimation of the scale parameters, the invariant loss cannot be 
handled. Note that the ordinary squared-error loss now corresponds to m, = 2 
and the restrictions on a, are automatically satisfied since a, > 2. Also, as before, 
the improved estimates for the different losses are functionally similar. For the 
special squared-error loss, each coordinate is shifted by the reciprocal of the 
geometric mean. This fact brings out a natural similarity between estimating the 
6s and the 87's in the sense that the shift by the improved estimate in one 
problem is just the reciprocal of the shift in the other problem. In other words, 
the reciprocal transformation on the parametric function is also exactly reflected 
in the improved estimate. Finally note that so far inadmissibility of the usual 
estimate of the natural parameters was known only for the squared-error loss (see 
Berger (1980a)); Theorem 3 establishes inadmissibility for a wide class of weighted 
quadratic losses. 
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We now get back to estimating the scale parameters and generalize Theorem 2 
in two directions for the ordinary squared-error loss. For notational simplicity, 
we have taken a, = a but the proofs go through with arbitrary a,s. 


THEOREM 4. Consider the situation in Theorem 2 with m, = —2. Let 5(X) 
be any estimate given as 


(2.13) d(x) = cane : Jarco 22, 











a+1 atl 
where t stands for (11?.,x,)'’? and r(-) is such that 


(i) 0 < r(t)/t < Xp — 1)/p(@ + 1}, 
(ii) r(t) is nondecreasing, and 
(ili) r(¢)/t is nonincreasing. 
Then 8(X) dominates X/(a + 1). 
Proor. Define h,(x) = x¢r(t). In view of Berger (1980a), it is enough to 
show A, solves (2.7) there and that with this choice of h,, the improved estimate 


is as in the statement of Theorem 4. 
First note 


1 
(2.14) WO(x) = axe r(t) + ee: 


Hence, the ith coordinate of the improved estimate is 
x, my hPa) x, ar(t)  tr(t) 

















ô = no 
(x) ati a+1 a+1 a+1 pla+1)’ 
which is of the form (2.12). 
Next, (2.7) in Berger (1980a) is equivalent to 
2rit) 2 2 ir(t)\ 2 
A_4{x) = - x,+——3 | ar(t) + x; 
sa) = EE Yt A [ante + P| E 





P tr'(t) \? 
Per a fart) + P | 


Eal r Ko) + (o+ 3 


5 
2 








(at mi 
pla +1/pÙ} , _Ap-) 2 tr'(t) r(t) 
(a +1) . pla +1) 2 x7(0)[sinceo ar “| 
pla + 1/p) 2AA 2(p ~ 1) 
(a+iy n (a +1) eae) 


<0.0 
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REMARK 1. The upper bound on r(é)/t can be increased to 2(p — 1)/ 
p(a + 1/p)*. However, for large p, this is likely to be immaterial. 


REMARK 2. With r(t) = ct (0 < ce < Yp — 1)/p(a + 1)*), one gets the im- 
proved estimates of Theorem 2. One advantage of providing an extended class of 
improved estimators like in Theorem 4 is that there is more scope for incorporat- 
ing prior information in choosing an alternative estimator (see Berger (1982)) and 
also that in this extended class one may actually find an alternative estimator 
which is admissible. In the normal problem, for example, the admissible minimax 
estimators were found from such an enlarged class of improved estimators (see 
Strawderman (1971) and Berger (1976)). 


REMARK 3. With ¢ = (Ip ,x,) 7P, a similar extended class of improved 
estimators is easy to find along the lines of Theorem 4 when m, = —1. 


REMARK 4. It is easy to show that for a broader class of linear estimates 
ô (X) = AX of the scale parameters, uniform risk domination can be achieved by 
shifting by the geometric mean when the loss is squared-error. We have been able 
to prove that if the elements a,, of A are such that (a, + 1/p)L?_,a,,< 1(> 1) 
for every j 2 1, then AX is inadmissible and 5(X) = AX + cM., X, P -lisa 
better estimator for suitable constants c. Many of such linear estimates AX, 
however, can be uniformly dominated in risk by other linear estimates and 
unfortunately, we have not been able to characterize all the admissible linear 
estimates in this case. Such characterizations were obtained in the normal case by 
Cohen (1966) and in the Poisson case by Brown and Farrell (1985). 


3. Empirical Bayes interpretation. In the results presented so far, the 
emphasis has been on establishing inadmissibility. However, the story does not 
end in merely knowing that a particular estimator is inadmissible; in fact, if we 
may say so, the interesting problems arise exactly at this point. Perhaps the most 
important and most interesting question that needs to be answered is whether 
one can build up improved estimators which conform to one’s prior beliefs about 
the unknown parameters. A lot of research has been done on these questions for 
problems involving the normal distribution and some results are known also in 
the case of Poisson distribution; in particular, in both problems, proper Bayes 
improved estimators are known for large enough p; also, the James—Stein (1960) 
and Clevenson—Zidek (1975) estimators have long been known to have an em- 
pirical Bayesian justification. In the gamma problem, finding Bayes or gen- 
eralized Bayes improved estimators seems difficult as the so-called “conditionally 
conjugate” two-stage priors are analytically intractable. In what follows, we show 
that the class of improved estimators obtained in Theorem 2 contains certain 
natural empirical Bayes estimators. 

Assume 0, ~ mdop gamma (2, r), where r~} > 0 is an unknown scale parameter. 
For squared-error loss, the Bayes estimate (given r) of 67 ' is 


5 x,+Tr 
(3.1) I ayers 
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Marginally, the x,’s are ii.d. with the joint p.d.f. 


a-l 
Tax; 


URREA + S l 


[ep hae) & 


In particular, the density of each x, is 


i Aar 
(3.2) f(x) = 7B(a,2) Urar 


Clearly, r is a scale parameter for the distribution of x,, x2,...,x,. The best 
scale-invariant estimate of r in the restricted class of estimators which depends 
only on the geometric mean is 








a(x) T eo(mx,)””, 
where 
o a Ea)? (Ela + 1/p)/Ta)? : (TO = 1/p)/T2)" 
Ga ° Ep02) (Ela + 2/p)/Ta)? : (T2 - 2/p)/T2)” 


_ (Tía + 1/p)T(2 - 1/p))? 
(Tha + 2/p)T(2 - 2/p))” 


The empirical Bayes estimate is what results when a(x) is substituted for r in 
(3.1). It can be verified that c, < 2(p — 1)/p- (T(a + 1/p)/T(a + 2/pi)” for 
p = 2. Hence this empirical Bayes estimate dominates the usual estimate under 
ordinary squared error loss. (See Remark 1 following Theorem 2.) 

The “empirical Bayesian” interpretation of the improved estimate is some- 
what ad hoc because of the restriction to the class of estimators for r which 
depend only on the geometric mean. The restriction, however, seems necessary 
because the usual Pitman estimate of r derived from the marginal! distribution of 
all X,s does not have an easy analytical form. 

Note that for the four losses dealt with in Berger (1980a), no Bayesian 
interpretation of the improved estimators obtained there is known except that 
for m, = —1, an approximation to the generalized Bayes estimate against certain 
flat priors looks something like Berger’s improved estimate (see Brown and 
Hwang (1982)). It will clearly be interesting to obtain improved estimates which 
can be actually linked with reasonable priors. 





4. Risk improvement. Whereas from a theoretical standpoint uniform 
domination is of interest, of utmost importance to the applied statistician is the 
question: How much risk improvement can be achieved in practice? For most 
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problems, computer simulation is necessary to get an idea of the actual risk 
improvement because the improved estimators are almost always such that any 
analytical representation of the risk difference is at least formidable, if not 
impossible. Fortunately, however, our improved estimators are such that an exact 
analytical expression is possible to obtain; (2.9) and (2.10) do in fact give the risk 
improvements A(@). Consequently, simulation of data will be quite unnecessary. 
Since ordinary risk improvement can be unbounded, it is more meaningful to deal 
with the percentage risk improvement. 
Assume a, = a and m, = m (+ 0). Using 


o mp = DE(a + 1 + m/2 ~ m/2p)(T(a - m/2p)) 
Qpl(a+m+2-m/p)(I(a—m/p))?* 





(i.e., the midpoint of the allowed range for c), by direct calculations using (2.9), 
R(6, ôo) 7. R(8, 5) 
R(8, ôo) 
ay ste 1)(T(a +1 + m/2 — m/2p)}(T(a - m/2p))"?” 
i © Ap(at 1)T(a +m +2- m/p)(T(a - m/p))” (La)? 





y EER? — TE 8? e ore 
ae 9.” 


It may be interesting to find a measure of some kind of average percentage risk 
improvement in various parts of the parameter space. In particular, of some 
statistical interest is the limiting value (as p -> 00) of the average (with respect 
to Lebesgue measure) percentage risk improvement when all @, belong to an 
interval [a, b]. Here, a could be thought of as the prior guess for min, <, < pf, and 
b as max, sıx phr 

Specializing to the ordinary squared-error loss, from (4.1), 


R(8, 55) — R(0, 8) 
pees J R(8, 85) a 


(p -1)°(T(a + 1/p))”? 


oa ~ pe Pa + (Tla + 2/p)) (T2)? 





x f (2/p : LE Or he HE 8 oe ye. KA I/R 


= du, 
1/p = 248, : 


where p stands for the normalized Lebesgue measure on [a, b]”. The first term in 
(4.2) monotonically increases to 1/(a + 1) as p > 0; also, since 6,’s are iid 
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uniform, by the strong law of large numbers, 


2/p - Epaibr? ~ Mpb 7? 0 
lim ie /P i 1 aa ) 1 di 
po 1/p ` Xið, 
sme] (2/p : EPa 0 = 1281/7 J0; P dy 
P>% 


fir y ER? dp 
= tim [2p(a-”? — b72) . (p/p — 1)? (B1? — al?) 


pro 


=(p/(p = 2) (b177 = a-*/*)?)/[(1/a - 1/b)(b ~ a)”™'] 
2ec log ce™ cle s/e- SE e*c(c se Le 2eheen tl 
7 e-1 , 


where c = b/a. In particular, along the line in which all 9,’s are equal to some 0 
(i.e., c = 1), by |’Hospital’s rule, the expression in (4.3) is equal to 1; hence, the 
percentage risk improvement along this ray can be considerable for small a, 
approaching 100% as p > œ, a > 0. 

Interestingly, the limiting average risk improvement depends solely on the 
ratio c = b/a and decreases monotonically as c increases. We will later provide 
actual values of this limit for various c. 

Thus for m = —2, encouraging risk improvements seem to be attainable. For 
other values of m it is difficult to obtain neat expressions for the iterated 
supremum as above since (4.1) is no longer so simple to handle. For m = 1, an 
iterated supremum was calculated by Berger (1980b) for his improved estimator 
and it was found that, up to approximately 10%, risk improvement is possible (as 
a — 0) along the same ray as we have considered. We do not have any corre- 
sponding results for m = 1 so that a direct comparison is not possible. However, 
certain numerical studies have led us to believe that for m = 1, Berger’s (1980a) 
estimators will usually give better percentage risk improvement. . 

Since (4.3) gives an idea of attainable risk improvement only for the squared- 
error loss, it is desirable to pursue this question in other situations as well. We 
have provided some numerical observations in Tables 1 and 2. We repeat that no 
computer simulation was done because it is not necessary to do it. Percentage 








(4.3) 








TABLE 1 
Percentage risk improvements 





Range of 8 p= p=i10 p=6 p=10 p= p=10 
(0, 5]? 12.26 18.25 4.55 6.33 2.18 2.77 
(10, 15]? 24.57 35.02 5.45 7.34 2.43 3.02 


(0, 15]? 7.81 9.96 3.69 5.12 1.80 2.45 
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TABLE 2 
Limiting average percentage risk 


unprovemenis (a, m 1) 


Limiting 
risk improvement 


Oo 


50.00 
49.33 
48.03 
45.16 
42.48 
40.10 
31.70 
26.60 
18.61 
13.74 

8.70 


br 


= 


risk-improvements are shown for different ranges of 6,’s and different m. The 
improvements were calculated for a fixed set of random @,’s uniformly distributed 
in the indicated range. For all the losses, percentage risk improvements are shown 
for corresponding to the constant c which is the mid-point of the allowed range. 
The values indicate that the percentage improvements are best for the squared- 
error loss; also, the improvements seem to be better for m < 0. Finally, the 
percentage improvements are larger for larger p. Throughout we have taken 
independent simple exponential distributions (i.e., a, = 1). 
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ADMISSIBILITY AND MINIMAXITY RESULTS IN THE ESTIMATION 
PROBLEM OF EXPONENTIAL QUANTILES! 


By ANDREW L. RUKHIN 


Rutgers University and Purdue University 


The estimation problem of the quantiles + bo of an exponential distri- 
bution with unknown location-scale parameter (£, o) is considered. We estab- 
hsh the admissibihty of the traditional (best equivariant) estimator for 
quadratic loss when n7! < 6 <1+n7! where n is the sample size. For 
b>1+n°' a class of minimax procedures is found. This class contains 
generalized Bayes rules and one of them is shown to be admissible within the 
class of scale-equivariant procedures. 


1. Introduction and summary. Let x,,...,x,, n = 2, be a random sample 
from an exponential distribution with unknown location-scale parameter (é, o). 
We consider the statistical estimation problem of a quantile 6 = £ + bo. Clearly 
if p, 0 < p < 1, is the order of this quantile, then b = —log p. It is assumed that 
the loss is quadratic, (6 — 9)’o~*, where 6 is an estimate of 0. This loss function 
is invariant under location and scale transformations. 

The estimation problem of exponential quantiles is of importance in reliability 
theory, life testing, and related subjects. Many papers have been dedicated to 
practical aspects of this problem [see, for instance, Epstein and Sobel (1954), 
Epstein (1962), Ali, Umbach, and Hassanein (1981), Saleh (1981), Ali, Umbach, 
and Saleh (1982)]. 

From a theoretical point of view this problem is interesting since the best 
equivariant estimator of a quantile, which is a linear function of unknown 
location and scale parameters, is always minimax but typically inadmissible. In 
the case of a normal sample, for example, this estimator is inadmissible if b + 0 
and admissible if b = 0 [see Zidek (1971) and Rukhin (1983)]. 

Rukhin and Strawderman (1982) established the inadmissibility of the best 
equivariant estimator 5, of an exponential quantile for which either 6 > 1 + n7! 
or 0 < b < n™!, where n is the sample size. They exhibited a class of procedures 
which have risk uniformly smaller than that of 5). However all these procedures 
coincide with 6, with positive probability and none of them can be admissible. A 
similar result was obtained by Rukhin and Zidek (1985) in the case of several 
independent exponential samples. 

In this paper we show that in the case n~! < b < 1 + n7! the estimator 56, is 
admissible. In the case when b> 1+ n`! we construct a class of minimax 
procedures which are different from 6, with probability one. Some of these are 
generalized Bayes rules. An explicit formula is given for a minimax generalized 
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Bayes estimator which is admissible within the class of all scale-equivariant 
procedures. 

Heuristically the admissibility result for ô in the case n7t < b < 1+ n`! is 
due to the fact that it is generalized Bayes not only with respect to the uniform 
(right Haar) measure over the group of linear transformations of real line, but 
also with respect to many other prior distributions. Some of these distributions 
are considerably less “flat” than the uniform one, so that they can be “better 
approximated” by probability measures. The latter fact is known to be responsi- 
ble for admissibility [see Stein (1965) and Brown (1979)]. Thus our admissibility 
proof is just a slight modification of the standard one for the one-dimensional 
location parameter [see Blyth (1951), Stein (1959), and Farrell (1964)]. The 
nonuniqueness of the uniform distribution as a prior happens also in many other 
problems involving location-scale parameter. This phenomenon leads to many 
surprising admissibility results for traditional estimators of functions of € and o 
for normal and exponential samples. 

The above-mentioned admissibility result along with the admissibility within 
the class of scale-equivariant procedures is given in Section 4. In Section 3 we 
study the form of generalized Bayes estimators and prior distributions which 
generate the estimator 6). In Section 2 a class of minimax procedures when 
b > 1+ n`! is constructed. The paper is concluded by a discussion of some open 
problems in Section 5. 


2. A class of minimax quantile estimators. Let x = min, <,<,%),, Y= 
n-'Y¥x,—x. Then (x, y) is a version of a minimal sufficient statistic, and its 
distribution has density o~*p((x — £)/o, y/o) where 


(2.1) p(x, y) = n” exp{—n(x + y)}y?-?/T(n - 1) 


for x, y= 0 and p(x, y) = 0 otherwise. 

Any equivariant estimator 6 which depends only on x and y must be of the 
form (x, y} = x + cy for some constant c. It is easy to see that the best. choice 
of c, which minimizes the risk of such estimator, is c = a = b — n~!. This best 
equivariant estimator 5,(x, y) = x + ay is known to be minimax, but inadmissi- 
ble for a > 1 or —n7! < a < 0. Due to the structure of the exponential distribu- 
tion it is convenient to study the minimaxity for £ > 0 separately. Thus we call 
an estimator ô to be minimax (admissible) for > 0 if it is minimax (admissible) 
when the parameter space is restricted to {(¢, a), £ = 0, o > 0}. It was noticed by 
Rukhin and Strawderman (1982) that the minimaxity of the estimator 8(x, y), 
x > 0 for > 0 implies the minimaxity of 6,(x, y); (x, y) = 8(x, y), x 2 0; 
= (x, y) x <0. In this section we obtain a class of procedures which are 
minimax for > 0 when a > 1. Hence a class of minimax rules is obtained for 


all &. 
We study scale-equivariant estimators 6 of the form 
(2.2) d(x, y) =x+ay—2ayf(y/x), x>0, 


where f is a positive measurable function. 
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THEOREM 1. Fora > 1 an estimator ô of the form (2.2) is minimax for ¿ > 0 
if for some positive p, {(z)(1 + z~')? is a nondecreasing function of z, z > 0, 


(2.3) O<f(z)\1t2') <f, 
and 


(2.4) af <(a- 1)min|1, p(n +2p+1)(n+p)'(nt+pt 1) °] An +1). 


Proor. Let 7 = n/c. Then the risk function R(é, o; 6) of any procedure 
(2.2) depends only on 7, so that one can put o = 1. For 7 > 0 


A(n) = R(n, 89) — R(n, 8) 
= 4a[n?T(n - DSS foals — + ay(1 — {(y¥/x)) — bn] 
xy exp{—(x +y- 9)} ddy 


(2.5) =4a[n’T(n- D] e f(z)e" (i +z) °°? 


x |(1 + a2(1 — f(z))) i vane un du 





—(bn+7)(1 + z)” e-tu" du dz. 
n(l+z) 


Because of (2.3) 
a(n) = 4a[n?I(n — 1)]~ re ezra 42) 7°? 


x fa + a(i — fitz) f emur du 


aa +z) 
~ (bn +m) + 2) fÀ urdu dz. 
Ai 

To establish the inequality A(n) = 0 or to prove the minimaxity of 8 we use 
Lemma 1 from the appendix which was used earlier in similar problems by 
Baranchik (1970) and Strawderman (1974). 

To apply Lemma 1 write the integral in the preceding expression as 
fg(z)h(z) dz where g(z) = {(z)(1 + z71)”. We need first to show that k changes 
sign at most once from negative to positive. 

Denote 


P(t)=1L+t+ ++ +t"/n!= e'f e-tu" du/n! 


t 


and 


I(2) = h(z)(1 $ z)" Pgno +a]P (a(l te z)). 
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It suffices to show that 1’(z) > 0 if (z) > 0. The latter inequality means that 
(bn + 9)(1 + 2)P,(n(1 + 2))/[(n + Praia + z))] 
<1+az(1—f(ltz7')7”). 
Since for all t, P,_,(t)P,4,(£) < P(t), one has U(z) > 0 if 
(on +n) P,(n(1 + z))/[(n + 1)Pror(nd + 2))] 
< all -į +z) "(+p + z)~*)]. 
Inequality (2.6) implies (2.7) if 
1sall-f(pt)at+27)?| 


(2.6) 
(2.7) 


or 
(2.8) af(p+i1)<a-1. 


This inequality is met because of (2.4) and Lemma 1 is applicable. Thus it 
remains to be shown that 


[on(z) ad 20. 

0 

Using the easily verifiable formula 

(m+ i) foetu —7)"* du = Ba + Ae e-“u"*! dudz, 
q 0 (+z) 


one sees that (2.8) means that 
(a-1)(n+p+t+ 1 fre u-*(u — q)" du 


q 


+(n+p)'f 


Petu- PTH y -7)"*? du 


—af(n+2p+ g raa — q)" P du 
7 
—(an+t+1)(n +p) f e-tu?(u — q)”*? du 2 0, 
7 
which follows from the corollary to Lemma 2 and (2.4). O 
COROLLARY 1. The estimator suggested by Rukhin and Strawderman (1982) 
with f(z) = max[0, a — 1 — anz]/[((n+ la] is minimax for 20 if az 
1+ n/(Xn + 1) + n(1/4 + nn + 1)}2/(n + 1). 
Indeed it is easy to check that f(z)(1 + 2~')? is increasing if p < an/(a — 1) 
= pọ Also af = (a — 1)/(n +1), and min[1, p(n + 2po + (n+ py) (n + 


Po + 1)°]=1 if pẹ- npọ- n(n + 1) < 0. This inequality follows from the 
condition of Corollary 1. 
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COROLLARY 2. Assume that for positive z, af(z) = (a — 1)v™P(v)/Q(v), 
v = 2z(1 +z) ', for a positive m, m < n + 1 and some positive junctions P and 
Q. The corresponding estimator (2.2) is minimax for £ > 0 if 


(2.9) O<p=m— max o[Q(v)/Q(o) — P’(v)/P(»)] 
and 
(2.10) P(1)/Q(1) <3p/[2(2n + 1)(n + 1)]. 


Indeed condition (2.9) guarantees that »?f({z) is a nondecreasing function of z, 
and condition (2.10) implies (2.4). 


3. Generalized Bayes estimators of exponential quantiles. Let A(¢, o) 
be the density of a (generalized) prior distribution over ($, o) with respect to 
right Haar measure dé da/o. Also let ņn = né/o and t = no}. We shall denote 


by A(n, t) the density corresponding to this transformation. The Bayes estimator 
p(x, y) has the form 


8,(x, y) = {fu + bo ]o~*p((x — £)/o, y/o) A(E, o) dë da/o 
MiG p((x — §)/o, y/o)A(E, a) dë do/a 


= f f (n + bn)eW"=*Y)*9\(m, t)t” dtdn 
y<& 


JI e75=+5 +A (q, tt" dtd. 
ysk 


Let K(u, t) = e~“{" e"A(n, t) dn. Then f= (7 — tx)eA(n, t) dn = 
— {* e"K(u, t) du, so that 


5,(x,y)=xt [rere | nk (tr, t) - emf e“K(u, t) du dt 
0 -œ 


/ Í SEK (tx, t) dt. 
0 


Of special interest to us will be prior densities of the form 
(3.3) A(n, t) =Un)t*%, t>0. 
In this case 


(3.2) 


K(u,t)= e= f" el(n) dnt" = K(u)t*, 
and with z = g/x 


glx, y) =x + y freer] nk (t/2) — e=" f“ eK (u) du dt 


| [E dt. 
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Thus the generalized Bayes estimator has the form (2.2) with 
j= [rewere | atek (2) - (an + 1)K(¢) + e~t f* e*K(u) du! dt 
0 — 00 


[zaz f"e-eer*ex (2) a. 
0 
An alternative formula for f(z) is derived by integration by parts 
TE freee | aak (1) + atK'(t) - e~ f" eK (u) du dt 
0 


[zaz [esere at]. S 


Clearly the best equivariant estimator ô (x, y) = x + ay corresponds to the 
choice l = 1, a = 0, K(u) = 1. However there are many other prior densities for 
which 6, is the generalized Bayes procedure. 

It follows from (3.2) that 6,(x, y) = x + ay if and only if 


[rere bnk (te, t) — ee e"K(u, t) du! dt 
0 — 9.0 


(3.4) 


= ay[~e7O%"K (tx, t) dt 
0 
= af et" [nK (te, t) + txKi(tx, t) + tK{(tx, t)] dt. 
0 
This identity holds for all positive y if and only if 
(3.5) K(u, t) -= eme f" e*K(s, t) ds = a[uK; + tK/]. 
It is easy to see that with d = 1/a, K(u, t) = e™“u® tt, u > 0; K(u, t) = 0, 
u < 0 solves (3.5) and in this case 6,(x, y) = x + ay for x > 0. 
Moreover if d > 1, ô is generalized Bayes also with respect to the prior 
distribution with density 
A(n, t) =(d-lje"n**t, >O, 
A(n, t) =0, n <0. 
We formulate the results as 


THEOREM 2. The generalized Bayes estimator 6, of an exponential quantile 
£ + bo corresponding to a prior density A(n, t), n = nt/o, t= n/o has the form 
(3.2). If d = 1/a > 1 and A(n, t) = e~*n? "4, n > 0; A(n, t) = 0,  < 0, then for 
x > 0, ôx, y) =x + ay. 


In the remainder of this section we consider the generalized prior density (3.3) 
with 
(3.6) L,(u) = Hu) = (a — 1) flere" R(s)ds, u>0, 
0 


226 A. L. RUKHIN 
where R(s) = LR_yr,s*(1 — s)""*. Then 


K(u) = we" + (a - 1) [Ee —e-“]s* R(s)\(1—s)'ds, u>0, 


w= f" e'l(t) dt, 


and the form of the generalized Bayes estimator can be derived from (3.4) and 
Lemma 3 of the appendix. We define coefficients r, so that the polynomial terms 
of degree less than n in (A.3) vanish, akr, = r,_(a(n — k) + 1), k= 1,..., n, 
To = 1, and we choose constant 1 = f? ,eK (t) dt = w + {°,,te'l(t) dt, so that 
the polynomial term of degree n vanishes as well: 


l= waa — (a—1)(w—r,)(n + a) + (a — 1)(7, + r1) 


(3.7) +(a-D(n(a-1)+1-a) D Bla t k,n- k), 


k 
m=T(n-ita fk, k=1.,n, mal. 


(=l 


Let 
Q(v) = f trte"'K(t/z) dt[oT(n + a + 1], 
0 


e=l- Ee +(a-—1)(n+ a) `’) + T Bla +k,n-— vlu 
k=0 


=] — y/w. 


THEOREM 3. Let the generalized prior density A(n, t) be of the form (3.3) 
with I(ņ) given by (8.6) and (3.7). Then the corresponding generalized Bayes 
estimator 5, has the form (2.2) with 


(3.8) Qaf(z) =(a — 1)0"* (ew — (a — 1)r, log »)/Q(v), 
o=z/( +2). 
Furthermore, 5, is minimax for § = 0 if 
3(a — Df - (n+a+1) 3 r,Blatk,n—k+1)| = 8(2n+1)(n+ lyr, 
k=0 


and w is defined by (8.18). 


Proor. Formula (3.8) follows directly from (3.4) and Lemma 3 of the appen- 
dix, and the minimaxity result follows from Corollary 2 to Theorem 1. Indeed 


Q(v) = Í ”pn+ae-tK(t/z)dt[oT(n + a + 1], 


Q(1) = K(0) = w, 
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and 
Q’(1) = -aw — (n+ a + 1)K (0) 
=(n+1)w-—(a-—1)\(n+a+ 1) fist "R(s) ds 
= (n+ 1)w —-(a-1)g. 
Also 


0Q'(0)/Q(v) =[n +1 + a(1 - »)]/(1 - v) 
(3.9) —v(1 fen oy? fPerta*e4K(t) at] fence *K (2) dt, 
which is easily seen to be nondecreasing in v. Therefore 
max, [0Q’(v)/Q(v) + (a — 1)r,/(ew — (a — 1)r, log v)| 


= Q(1)/Q(1) + (a — 1)7,/(ew). 
Because of Corollary 2 the corresponding estimator is minimax for £ > 0 if 


0<p=n+1—Q(1)/Q(1) — (a — 1)r,/(ew) 


(3.10) 
=(a-1[q-7,/e]/w 
and 
(3.11) e <3p/[2(2n + 1)(n + 1)]. 
Inequality (3.10) means that 


r,/q < E, 
and inequality (3.11) means that 


e <3(a —- 1)(q — r,/e)/[2(2n + 1)(n + 1)w] 
= 3(a — 1)(q — 7,/e)(1 — €)/[2(2n + 1)(n + 1)y]. 
These two inequalities hold for some e if 
e?[2(2n + 1)(n + 1)y + 3(a — 1g] — 3e(a — 1)(7, + q) + 8(a — 1)7, < 0, 

which happens if 
(3.12) 3(a — 1)(r, — q}? = 8(2n + 1)(n + 1)yr,. 
Recall that here 

q=(nta+1) È r,B(a+k,n—k+1) 


and 


y=7,(1+(a-1)(n+ a)" + T Bla +k,n-k). 
k=0 
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If (3.12) is satisfied then the value of £ = 1 — y/w (or w) in (8.12) generating a 
minimax estimator is found from the formula 


(3.18) e = 3(a — 1)(r, + q)[2(2(2n + 1)(n + 1)y + 3(a—1)q)] 7. o 


REMARK. It is easy to see that condition (3.12) is met for sufficiently small 
positive a. 


4, Admissibility results. We start here with the following result. 


THEOREM 4. If 0 <a <1 then 8,(x, y) = x + ay is an admissible estimator 
of exponential quantile £ + bo. 


Proor. If 5) were inadmissible then there would exist ô such that for all & 
and o 


R(¢, 0; 8) < E(§, o; 85) 


with strict inequality for some £,,0). Notice that we can assume that &, > 0. 
Indeed if 6.{x, y) = 6(x + c, y) — c, then 


R(¢, 0; 6,) = R(E + c, 0; 8). 
Thus 5, also improves upon 6, and is strictly better than ô at $ + c, which is 
positive for large c. 
Because of the continuity of the risk functions to prove the admissibility of 8, 
it suffices to find a sequence of positive densities A,,(§, 0), € > 0, such that 
f frnl§.0) dé do/o < œ 


and as m > œ 


Tn = ff{e@ o, 8o) — R(E, 0, 8,,)|A,,(€, 0) d&do/o > 0, 


where 6,, is Bayes estimator with respect to À m- 
A straightforward calculation shows that with p(x, y) defined by (2.1) 


Tim = ff (Bol, 7) — nlx, ¥))” dedy 
x f fo-*p((x — 8)/o, ¥/0)km(£, 0) dé da/o 
=n DIP fff [ol ») ~ Sale, y) Peres 


xy" A a(n, t) didn dx dy. 

Here we used our previous notation ņ = nt/o, t= n/c. The traditional 
condition would be 4,,(£, 0) —> 1 but in this case the desired limiting relation is 
false. Making use of Theorem 2 we put in the case 0<a<1, Àm, t) = 
e`ni th (t), n > 0, d= 1/a, where h,, are positive differentiable functions, 
SERET dt < 0, h(t) > las m> o. 
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One obtains from (3.2) 
ô„( x, y) — x — ay =a fete" ene (t/x) deJ fear h(t) dt. 
0 0 


Thus with a generic constant C independent of m 
fore) 2 
r= C e@ Meta rey t/x a| n-2„-n-2 
n= Cf f | i. (t/x) dt} y 
-1 
x | [ee Es) a| dydx 
0 
< Cf [yr x"? dyds f Me tet Dental hy (¢/x)]"/h,(t/x) dt 
0 


= Cfo h;,(0)]"/hg(v) ao. 
0 
Now we can specify the choice of functions h,, by putting 
h,,(v) = [1 + (log )?/m]~*. 
Then 
00 00 -1 
f h,,(0) dvo/v =f [1 + u?/m| ~> du < œ 
0 — 00 
and 
MOLAO do = am? f e +) ° dt>0 asm >o. 
0 0 


Thus if 0 < a <1, 7, ~ 0 and Theorem 4 is proven. In the case a = 0 6, is 
generalized Bayes with respect to any density of the form A(n, t) = A(¢), and in 
the case a = 1 with respect to any density of the form A(n, t) = A(7/t). Slight 
modifications of Blyth’s (1951) admissibility proof prove our theorem in these 
cases. Q 


THEOREM 5. For a = 1 the estimator 8, defined by (3.8) is admissible for 
¢ > 0 within the class of all scale-equivariant procedures. 


Proor. It suffices to show that there exists a sequence of positive integrable 
functions /,,(7), n > 0 such that 


(4.1) me PIRC, 84) — R(n, 8m) ]n(0) dn > 0, 


where 6,, is the estimator which minimizes [2R(n, 8)l,,(n) dn. It is easy to see 
that 


6,(x, y) = sff. (n + bn) ve +2) +2] (n) dyn do 
y<zo 


[ff ortle-O+z)+] (n)dndv, z=x/y. 
1<z0 
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Comparing this formula with (3.1) we notice that 6,, is the generalized Bayes 
estimator corresponding to the prior density A(n, t) = L,,(n)¢. If one defines J,,(7) 
by formula (3.6) for a = 1 + m”! then 


fren) dy = (a — 1) [s= R(s) ds < 0, 


and using a calculation similar to that done in Lemma 3 and Theorem 4 one can 
prove (4.1). We do not give here details since the conclusion of Theorem 5 can 
also be obtained by extension of Zidek’s result (1973) to the case of exponential 
quantile estimation. According to this result the generalized density /()é gener- 
ates an admissible (for > 0) procedure within the class of all equivariant 
estimators if 


J PUM] dn = 0. 


In our case Ln) = (a — 1)f5°e *R(s) ds ~ (a — 1)r,/n as ņn > œ, so that the 
integral above diverges. 


5. Open problems. An interesting unsolved problem is the question of 
admissibility of the estimator (3.8). Even its admissibility in the class of ail 
scale-equivariant procedures (2.2) for arbitrary positive « seems to be difficult to 
establish. Indeed the relationship between Bayes rules corresponding to a prior 
density (3.3) and the rules which minimize {R(n, 5)l(n) dy for a #1 is not 
evident. 

Another intriguing question is the largest amount of relative improvement 
r = A,(n)/R(q, 8) for minimax estimators ê. Because of (2.5) for 4 > 0 


r= saff eP) — 4 + ay(1 — f(z)) — bn] 
xy"dzdy/[n?T(n —1)(n-? + a?n™’)], 


and for a fixed value 7 = 7, the function fọ which maximizes this quantity in the 
class (2.2) has the form 


2afo(z) = a0, at+z-— eg (ba + no) ye 2+? dy 
No/z 


get I,-y(l +z) a} . 
na/z 


In this case 
r= sae" f fel z)y?* e720 +) dy dz 
zy>7 


[T(n -= 1)(n? + a?n7)]. 
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For instance, when n = 0, 2af.(z) = (a — 1)max{0,(1 — nz)/(n + 1)}, and 
= (a-1)T(n + 2) f a — nz) ( + 2)? dz 
0 


[[n%(n + 1)°T(n 7 1)(n~? Be a’n-")| 


= (a = 1) [1 - 2(n/(n + 1))"I/((n + D(a? + n=»). 

The choice of generalized priors leading to minimax estimators of location-scale 
exponential parameters also remains largely an open problem. It is not difficult to 
show that prior densities (3.3) produce tail-minimax estimators only if (n) ~ 
C/\n|* as jn) > œ [in which case f(%) = 0]. One may conjecture that the prior 
densities corresponding to minimax Bayes estimators of the scale parameter o 
also generate minimax estimators of quantiles + bo. However this conjecture is 
false. In fact, Brewster (1974) has found a minimax estimator of o for quadratic 
loss. This estimator coincides for positive x with the generalized Bayes estimator 
for the prior density A(é,o)=1, €>0; =0, <0. The resulting quantile 
estimator has the form (2.2) with 

f(z) = (a - 1)z”[a((1 +2)" - 2”) >, 2>0, 


and it is not minimax for & > 0. 


APPENDIX 


We give here three technical lemmas needed to prove Theorem 1 and Theo- 
rem 3. 


LEMMA 1. Let X be a random variable taking values in an interval I, and let 
g be a nondecreasing nonnegative function defined over I. Assume that h is a 
function which changes sign at most once from negative on I and such that 
Eh(X) 2 0. Then 


Eg(X)h(X) > 0. 


Proor. Let c= sup{x: h(x) < 0}. Then (g(x) — g(c))h(x) is nonnegative 
for all x from I and 


Eg(X)A(X) 2 g(c)EA(X) = 0.0 
LEMMA 2. For all positive p and t 
fret P(u ~—t)"*?* du 
t 


(A.1) —n(n+p+i1)(n +p)" f eu?(u — t)™™? du 


=pl(n+1)(n+p)] ot f” e-tu-??(u — t) Pt dy 
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and 


(A.2) (n +1+ t) f e-tuP(u = t)"*? du z f euu = tyrtP du. 
t t 


Proor. Inequality (A.1) means that 


oo _ 
f e`"(u + t) ?u™P 
0 


x[u-n(n+p+i)(nt+p)” 
—p[(n+1)(n + p)} u+ {u + t)?| du > 0. 
It is easy to check that for a fixed ¢ the function 
h(u) =u- n(n +p +1)(n +p) - pl(n+ 1)(n + p)] wu + t)” 


changes sign only once for u > 0. Since the function g(u) = uP(u + t)? is 
increasing, (A.1) will follow from Lemma 1 if we show that 


f e-ur|u -n(n+p+1{n+p)` 
0 
—p[(n+ 1)(n + p)]~*u?*Mut | du > 0. 
The latter inequality is equivalent to an evident one: 
ae ade +t) ’du<(n+1)!= [eure du. 
0 0 


Inequality (A.2) also follows from Lemma 1 with the same function g and 
h(u)=u-n-1.0 


COROLLARY. For all positive p and t 


o 


(a - 1)p[(n+p+1)(n+1)(n +p) e~4u-2P(y — t)"*?P*! du 
<(a-1(n+p+ D7 f eztu (u — t)”*P* du 
+(n +p) fPetur Mu — t)"*? du 
-(an+1+t)(n +p) fe *u*(u — t)"*? du. 


* 


LEMMA 3. Let fort 20 


K(t) = we'+(a- opes — e*]s*-3R(s)(1 — s)`> ds, 
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where R(s) = L2_or,8"(1 — 8)"~*. Then with v = z/(1 + z), L= f? eK “(t) dt, 
PE aak(2) + atK (t) —- e~t f e*K (s) as| dt 
0 — 00 


=T[(n+a+lz"(1+2z)%* 


x (aa = Dle —r,—-(a-1)r,(nt a) * 
Mane pAlb we »)]| 
k=0 


+o" (uae ~1)\(n+a)*- (a-1)(w-1,) 


(A.3) Cah ee pie ee Cee 
ile aa a) E Bla pins | 
~v"*! log o(a — 1)"r, — 0” log »(a — 1) 
X(r,_; — anr,)(n + a)? 
iesi T oAB(a ia Daa 
x [r,_,(a(n — k) +1) - ctr] 
and 


[het Re) dt 
0 
=T(nt+ati jz" (1+2)% 


xf oF og o(a = 1)r, 
n-1l 
vor = (a oo 1)r,(n + a)! = (a i 1) £ r,B(a + k, n— n) 
k=0 


+(a-1)¥ Bat hyn bry ra) r =0. 
k=0 
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Proor. Using integration by parts one obtains 
H(t) = aaK(t) + atK(t) - a e°K (8) ds 
= [waa — 1]e~' + [w — R(1)](a - 1)te™! 
+ [(aa —1)(a—-1) -(a- ae] "Te" —e~*]T(s)(1 — s) ‘ds 
-—(a- yfe” — e` ]T (s) ds + a(a- it f 'e=*T(8) ds, 
0 0 
where T(s) = s*~'R(s). A straightforward calculation shows that 
[rere te He) dt 
=T(n+a+t1) 
{ (wa spara “(ney Raed 
x[RQ) - w] +2) 77 
+a(a— 1) f’T(s)(s ay ds 
+(aa—1)(a—1)(n+ a)” 
x f'T(s)[(s egy eis) 1S) de 
(a= 1)? f’T(s)[(s egy ee (ie) © Wa) de 
—(a-1)(n+ a)" 
x [Tle ra Paes 0e s)" ds). 
By using integral representation of beta function one obtains for k < n 
[era - s)" (s +z)" *-(1+2z) "| ds 
=Blath,n—k)[zt-"(1 42) — (142) "| 
and by letting k tend to n one obtains 
[err [(s ta %-(14z) "(1-8)" 'ds 


= [log(1 + z2) —logz](1 +z) %7”. 
Also for k < n, 
S a — s) (et zy et (1 + z) 7] ds 
0 


=Blat+k+1,n—k)z*-"(1+2) “7*7 
+B(a +k, n— k+ 128-142)" 4 
—~B(a + k,n — k)(1 + z) °°" 
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and 
[ert |(s +2) 7" -—(1+ 2 eG —s) ds 
0 


= {log(1 + z) — log z + [(n+a)z]“*}(14+2)°7 7". 
Using these formulas we deduce that 
[ee HG) dt 
{ 


) 
=I(n+a+1) 


x (waa -In + a) (1+2) 7" + (a-1)(r,- ww) +2) 77 


+a(a-1) x r,B(a + k,n — k+ 1)zt (1 42) °* 
k=0 
+(aa—1)(a—1)(n+ a) 


x (oa +z) — logz)(1 +z) “7 
+ p3 E ERE k)(z*-(1 = zy" - (1+ a) 
k=0 


— (a— Dll +2z)—-logz+ ((n+ a)z)~")(1 +z) "7"! 
n-l 
+ ¥7r,Blatkt+in—k)z*-"14+2z) TT 
k=0 
+Blat+k,n—k+1)2*-"-(14+2)77" 


~Blatk,n-k)(l+z) 7" 
—(a—-1)(nt+a)” 
x] (n(n +e — 1) — na 
x (log(1 + z) — logz + ((n + a — 1)z)`')(L + z)77” 
+ HG +k-1)(Blath,n—k)z*™(1 +2)" * 


k=0 
+B(a+k-1,n— k+ 1t {1 +z) Ttt 


— Bla+k-1l,n-k+1)(1 +z) 7") 
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n-2 
- Vin(n-k)(Blatk+1n—k-1ztr 142) 
k=O 


+ B(a + k, n— k)z* "(1 + z)7®7* 


The first formula of Lemma 3 follows now after some calculation, and the 


second one is proved analogously. It follows also that 
Q(v) = [oer exp(-t)R(t/z) dt[oT(n+a +1] 
0 


— B(a+k,n- k(l + z) 75) 





- (a — 1)r,v"*! log v 


+ fe —(a-1)(n+a)'r,-(a- DF r,B(a + k,n — k)|v"*! 
k=0 


+ (a—1) È} B(a+k,n-k+1)(r, +r i)o, rr, =0.0 
k=0 
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ESTIMATION OF SURVIVAL CURVES FROM DEPENDENT 
CENSORSHIP MODELS VIA A GENERALIZED 
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This article presents a family of estimators of the survival function based 
on right-censored observations which admit the possibility that the censoring 
variables may not be mdependent of the true failure vanables. This family is 
obtained by generalizing the self-consistent property (Efron, 1967) of the 
product limit estimator (Kaplan and Meier, 1958). By assuming a Dirichlet 
process prior distribution of the observable random vectors, nonparametric 
Bayesian estimators of the survival curve—which is also a member of this 
family—are derved under a special loss function. These nonparametric 
Bayesian estimators generalize results of Susarla and Van Ryzn (1976), who 
impose a Dirichlet process prior on the failure survival function without 
considering any prior distribution of the censoring variables. Large sample 
properties of this family of nonparametric Bayesian estimators are also 
derived. 


1. Introduction. Let X?,..., X? be independent random variables, each 
sharing the same survival function S°(t) = P(X? > t) with the random variable 
X°. The variables X°, i = 1,..., n present the true failure times on n individu- 
als subject to right censoring. The observations consist only of independent 
random vectors (X,, 5,), i = 1,..., n, with the same distribution as (X, 6) where 
X is an observable random variable and 


(1.1) ea ( if X = X°, 
0 if X= -X°: 


The aim is to estimate S° from the data (X,,6,),...,(X,,5,). This type of 
problem arises in many practical situations—such as cancer research, biomedical 
studies of survival, and life testing—and has been treated by a number of 
authors. [See Kalbfleisch and Prentice (1980) for a recent list of references. ] 
Kaplan and Meier (1968) suggested a product limit (PL) estimator for $°(-) 
and showed that this estimator is in fact a maximum likelihood estimator. Their 
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estimator S2,(t) is given by 


1 bo XG: 
k 
(1.2) S$},(t) = ma -d/n,)  Xğst<Xőņp k=1,...,m- 1, 
0 Xø st, 
where X(t, < XA < --- < Xë, represent m distinct observations among 


X,,..., X,, with 


n,= YX, > Xt), jJ=1,2,...,m, 


tm] 


d,= UX, =X%,6,=1), j=1,2,...,m, 
m 
and J(*) denotes the indicator function. 

This product limit (PL) estimator has received a great amount of attention in 
recent years. Breslow and Crowley (1974), Földes and Rejtö (1981), and Meier 
(1975) based study of the properties of PL estimators on some continuity 
assumptions and special structure of the censoring mechanism. Langberg, 
Proschan, and Quinzi (1981) show that the PL estimator is strongly consistent in 
dependent random censorship models. 

Efron (1967) established a property of the PL estimator, which he named the 
self-consistency property. An estimator $° is said to be self-consistent if 


n J(X,>t) 1—8, $%(t) 
1. (i) = Z —— + ¥§ — 

2 e a 

That is the proportion estimated to survive past ¢ is equal to the proportion of 
the subjects observed to survive past £ plus the sum for all individuals censored 
before t, of the estimated conditional probability of surviving past ¢ given 
survival to the censoring time. 

In Section 2 we extend the definition (1.3) to a more general situation. Based 
on this extension, we derive a family of estimators. Large sample properties of 
these estimators are also derived, under weaker conditions that the conditions 
imposed by Breslow and Crowley (1974). 

Susarla and Van Ryzin (1976) [hereafter referred to as SV (1976)] applied 
Dirichlet process priors of Ferguson (1973) to S° and obtained a nonparametric 
Bayesian estimator under a squared error loss function. They found that their 
Bayesian estimator reduces to the Kaplan-Meier PL estimator as the “prior 
sample size” tends to zero. 

In Section 3 we derive a Bayesian type estimator from the generalized 
self-consistent property. We also show this Bayesian type estimator can be 
derived formally for a special loss function by use of Dirichlet process priors for 
the distribution of the random vector (X,, 5,). If one puts Dirichlet process prior 
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only on the uncensored observation, then the above nonparametric Bayesian 
estimators reduces to the estimators of SV (1976). 
We conclude with some extensions to more general cases. 


2. A generalized self-consistency property. In order to unify the nota- 
tion for discrete, continuous, and mixed cases of random variable X°’, we follow 
the notation of Kalbfleisch and Prentice (1980, pages 8-9) and define 





+ DS*(u) 
0/4) = tt 
(2.1) NO | SGA) 
and 
(2.2) y(A®\(e) = im TE {1 ~ [AX%u,) = A%u,.)]}}, 
where 0 = uy < u, < -:: <u,=t, the limit r > oo is taken as u, — u,_, > 0; 


and the integral and operator “D” in (2.1) are Riemann—Stieltjes integral and 
differential operator, respectively. From definitions (2.1) and (2.2), we obtain 








sst S%(s ) 


where the integral ¢;DF/H means integration over the intervals of points less 
than £ for which F(-) is continuous, and AF(s) = F(s~) — F(s*). 

In estimating the survival function S°(f) in the presence of censoring, various 
authors such as Breslow and Crowley (1974) and Meier (1975) typically adopt, for 
mathematical simplicity, one of the following two censorship models: 


SO =O = oe far) TL S) 


(M.1) Independent random censorship models: There exist independent, identi- 
cally distributed censoring random variables C,,...,C, such that X, = 
min(X?, C,), 6, = KX? < C,), and C, and X? are independent. 

(M.2) Fixed censorship models: There exist n constants c,,...,c, such that 
X, = min( X?, c,) and 6, = (X? < c,). 


Furthermore, all authors make one of the following continuity assumptions for 
S° and S? (where S? is the survival function of C,): 


(C.1) The functions S° and S? have no common discontinuities. 
(C.2) The function S° is absolutely continuous and/or S? is absolutely continu- 
ous. 


Assumption (C.1) is obviously weaker than (C.2), but it need not hold in many 
practical situations of interest. As a matter of fact, the PL estimator $8, defined 
in (1.2) is adjusted for ties, and to the best of my knowledge, no rigorous proof of 
the consistency of $2, exists in the literature that omits assumption (C.1). In this 
section we show how a generalized self-consistent property can be used to obtain 
a family of consistent estimators of S° without making any of the assumptions 
(M.1), (M.2), (C.1), or (C.2). 
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Let 
(2.3) S(t) = P(X > t,6=1), 
(2.4) S{t) = P(X > t,6 = 0), 
and 
(2.5) S(t) = P(X > t) = S(t) + St). 


We now assume that the censoring mechanisms should satisfy 
(Aa) ['P(X°> 4X = x, ê = 0) DS(x-) = [/s%(t)/8%(x) DS x7) 
o 0 


forO<x<t<o. 

From Lemma 2.2 and Definition 2.4 below, it is natural to require (A.1) in 
order to obtain consistent estimators of S° by using self-consistency approach. 
There are other conditions suggested by other authors from different approaches. 
The detailed comparison is postponed to the Remark 2.10. 

It can be readily seen that (A.1) holds under the models (M.1) and (M.2). 
Moreover, even if the censoring random variable C, under model (M.1) is not 
independent of the failure random variable X,°, in certain cases (A.1) may still 
hold. We illustrate with an example. 


EXAMPLE 2.1. Let (X°,C,), i= 1,..., be independent, identically distrib- 
uted random vectors having the bivariate exponential distribution of Marshall 
and Olkin (1967) with the survival function 


(2.6) S(t, t2) = P(X? > t, C, > ta) 


= exp(—A,#, — Agt, — Ay2max(t,, tz)) 
for £, tg > 0 and A,, Ag, Ayo > 0. Straightforward computations show that 
P(X? > t|X, = x,8, = 0) = P( X} > tX? > C, = x) 
= exp(—(A, + ài2)(¢ — x)) 
= S(¢t,0)/S(x,0); 
hence (A.1) holds. 


LEMMA 2.2. Let S, and S, be as defined above. Then 
(2.7) S(t) = S(t) + SAt") — f° P(X° > tX = x, 8 = 0) DS (x7) 
0 
fort> 0, 


where t* = tt ort. 
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PROOF. 
S°(t) = E(I(X°> t)) = E(E(I(X° > #)|(X,8))) 
2 [exe > t)|K =x,6 = 1)D(1 - S,(x-)) 


+f E (I(X° > £)|X = x, 8 = 0)D(1 — S(x7)) 
ai E(I(X° > t)|X = x, 8 = 0)}D(1 — S,(x7)) 
sss P(X? > t|X = x, ô = 0) DS( x7). 0 


THEOREM 2.3. If and only if (A.1) holds, then 


S(t) 
Sx) 





(2.8) 8$%(t) = S(t) + S(t") - [ y DS(27) fort > 0, 


where t* = t* ort”. 
Proor. Lemma 2.2 implies the result. 0 


Theorem 2.3 has been derived by Tsai and Crowley (1985) under models (M.1), 
and served as an important equation for studying large sample properties of PL 
estimators. 

Now we define a generalized self-consistency property of an estimator $ of S°. 


DEFINITION 2.4. An estimator S° of S° is said to have a generalized self-con- 
sistency property if and only if there exist consistent estimators s, and 
respectively, of S, and S,, such that 


. G0 
BD 8%) = 80) + 80) - [Ha 


Let $ At) and S(t) be the empirical subsurvival functions of S, and S, respec- 
tively, so that $(t) = Set) = nE (X, > t,8,=1) and “§(t) = SeH) = 

n~'yn_I(X, > t,8, = 0). Then (2.9) reduces to (1. 3), which is the definition 
of the self-consistency property of §° given by Efron (1967). 


DS§(x~) fort 20. 


THEOREM 2.5. Let T = sup{t|S(t) > 0}. Then the unique solution of (2.7) or 
(2.8) for t < T has the following explicit expression: 


so =af- Ses JO 


: " a) 1 h As,(x) ) 


xst g S(x) 





(2.10) 
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Proor. From (2.8), we obtain 





0 = —DS(t) + DS,(t) + DS(t~) — (DS(t)) cet Sx") 
S*(t) 
TEO 

sa 

= —DS%(t) f + [izg se] + DS,{t). 


For ¢ < T, S(t”) + 0 implies S(t) # 0; therefore we have 


DS{t7) 











DS(t) | or r S(t) 
tk (£) + f Pay DS(x7 ) = DS,(t). 
Thus, by using (2.8), 
DS°(t) DS,{t) 
Sr) S(e ya s(t) 
Hence 
amo we fran 
or 





so = 10M) =- ESO 


+¿DS,(x) ce 
-|f n i 3) 


The following two corollaries are direct results of Theorems 2.3 and 2.5. 








COROLLARY 2.6. If and only if (A.1) holds, then the subsurvival functions 
S°), S{*) determine the survival function S(t) according to expression (2.10). 


COROLLARY 2.7. $ °(t) is a generalized self-consistent estimator if and only if 


there exist consistent estimators $ A) and Su ), respectively, of S,{*) and S{*) 
such that 


5 DS, (x) 
§%(t) = {-£ Sees | 
Let y be a family of estimators (§,(+), $.(+)) of (S,(+), S.(+)) such that 


sup |§,(¢) — S,(¢)|= of) | a 
O<t<T n 
log log n 
sup ISt) - §,{t)|= o| y Ak a | a.e. 
Ost<T 


(A.2) 
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and 
yn (( S, 8) — (S,, S.)) converges weakly to a bivariate Gaussian process 
(X, Y) which has mean (0, 0) and a covariance structure given for s < t < T 
by 
Cov( X(s), X(t)) = [1 - S,(8)]S,(t), 
as Cov( X(s), ¥(t)) = -8,(8)8,(4), 


Cov(¥(s), X(t)) = —S,(s)S,(t), 
Cov(¥(s), ¥(t)) = [1 — S.(s)]S.(2). 
Define a family 4° of estimators of S? by 
r+ DS (x 
po = (S98) = (| ear ical OOE € e). 
By slightly modifying Theorem 4.4 of Tsai and Crowely (1985), the following 
large sample properties of $° € ẹ° are established. 


THEOREM 2.8. If (A.1) holds, then the following two properties hold for 
every §° e yp: 


log log n 
(i) sup |8°(t) — $°%t)|= oly es | a.e. as n > 0, 
OstsT'<T n 


(ii) Vn ($° — S°) converges weakly to a Gaussian process Z with mean 0 and 
Cov(Z(s), Z(t)) = S%(s)8S°(t) f 'S-?(x —}DS,(x) forsst<T*<T. 
0 





PRooF. Since the only two properties of S£ and Sf used in Theorem 4.4 of 
Tsai and Crowley (1985) are properties (A.2) and (A.3), therefore in replacing 
(Sf, SE) by (S,, 8.) € 4, the whole proof here may be carried through the same 
way. O 


REMARK 2.9. Under (M.1) and (C.1), Peterson (1977) derived a unique expres- 
sion for S° in terms of S, and S, which is equivalent to formula (2.10). In 
addition, formula (2.11) is a well known result under (M.1) and (C.2) (Breslow and 
Crowley, 1974). 


REMARK 2.10. For the special case in which S, = Sf and S, = Sf, Efron 
(1967) derives the result of Theorem 2.5 by mathematical induction. 


REMARK 2.11. In the literature, there are quite a few mathematical formula- 
tions of what restrictions should be placed on the censoring mechanism so that 
the standard methods of analysis are appropriate. Williams and Lagakos (1977) 
derived constant-sum models from the likelihood function. A model for right 
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censored survival data is of the constant-sum type if and only if 
dS,{t) t) (o> Ax) 
‘dS(t) S(x) 


A special model of the constant-sum type is a survival independent censoring 
model introduced by Williams and Lagakos (1977) which satisfies 

S%(t) 
S°%(x) 
that is, censoring at time f carries the same information as survival beyond time 
x. Since (A.5) implies (A.1), (A5) is therefore also a sufficient condition for 
Theorem 2.3. 


Another formulation, outlined by Cox (1975) and more formally defined by 
Andersen and Gill (1982), is the model which satisfies 


dS,(x)  pedS°(x) 
0 S(x-) i S°(x7) 
that is, the failure rate of an item on test at time ¢ should be unaltered by the 
censoring that has taken place. Kalbfleisch and MacKay (1979) proved that the 
constant-sum model (A.4) is equivalent to (A.6). It can be readily proved that 
(A.1) is also equivalent to (A.6) by following a similar line of proof as for Theorem 
2.5. 

Another formulation was due to Langberg, Proschan, and Quinzi (1981), who 
under assumption (C.1), derived a necessary and sufficient condition of Corollary 
2.6 as follows: 


(A.4) 


(A.5) P(X? > 4X, =x,8,=0) = forO<x<t< 0; 


(A6) 





for0 < £< œ, 


iG el 
i RD ae 
otherwise, 


for every jump point t of S,, 





and 

P(C 2 t|X° = t) = P(C > tX? > t). 
They also proved the strong consistency of the PL estimator, under (A.7) and 
(C.1). It is conjectured that (A.7) is equivalent to (A.6) under assumption (C.1). In 
certain realistic situations, assumption (C.1) may not always hold; then (A.1) or 
(A.6) is preferable, and the results of Theorem 2.6 are stronger. 


REMARK 2.12. Recently, Robertson and Uppuluri (1984) [hereafter referred 
to as RU (1984)] generalized the PL estimator by using the idea of redistribution 
of mass to the right which was first considered by Efron (1967). Their generali- 
zation has strong connection with (2.7). Let P(X° > X, lX = Xa» ô = êa) = 
W,,, where Xa, < +++ < Xin) are the order statistics of X,,..., X,,and(W,,)axn 
be the RR matrix defined by RU (1984, page 368). Then the estimator $° 
obtained from the procedure 1 of RU (1984, page 369) is a solution of 


$(£) = Se(t) + S(t) — [T P > t|X = x, = 0) DS¢(x), 
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although their generalization is restricted to discrete estimators and cannot 
guarantee to obtain a consistent estimator of S°. 


3. Nonparametric Bayesian estimator of S°. Let P be a random prob- 
ability measure on (Q, IT), where Q = R*x {0,1}, II = B x C, B is the o-field of 
Borel sets restricted to Rt, and C = {¢, {0}, {1}, {0,1}}. Let a* be a nonnull 
finite measure on (2, II). Furthermore, assume the random measure P to be a 
Dirichlet process on (2, II) with parameter a* and (X,, 6)),...,(X,,4,) to be a 
random sample of size n from this Dirichlet process P. Our purpose is to estimate 
the survival function S° from a Bayesian point of view. [For the definition of the 
Dirichlet process and some basic results, see Ferguson (1973).] 

The nonparametric Bayes estimators of S(t) and S(t) are, respectively, 


a*((é, 20), {1}) + E(X, > ¢, 6, = 1) 
at(Q)+in 3 





§,(t) = 


and 
a*((t, œ), (0}) + DIX, > t, 6, = 0) 
o(Q)+in , 





S(t) = 


under the squared error loss function 
L(Š., S.) = Í [S.(u) — S.(u)]? dw(u), 


where w+) is a weight function, $.(w) is an estimator of S.(u), and . = u or c. 
Therefore, we may derive a self-consistent estimator S°(+) of S°(-) from S,(+) 
and S{+) by 


p os DS, 
ewe -f SG) oO 


(3.1) 

= 7+ D(a*((x, 00), {1}) + E(X, > x, 6, =)) (t) 

aR -f a*([x, 00), {0,1}) + EZ, KX, = x) ` 
It is easy to show S,(-) and S,{-) satisfy conditions (A.2) and (A.3); therefore, the 
following results follow. 


THEOREM 3.1. Jf assumption (A.1) holds, then 


X log log n 
(i) sup |8%(¢) — $%t)|= oly a ae | a.8., 
O<tsT*<T n 


(ii) ¥n(S° — S°) converges weakly to a Gaussian process Z, as defined in 
Theorem 2.8. 


REMARK 3.2. It can be shown that S° is the Bayesian estimator of S° under 
the loss function L(S°, S?) = JELY ASINE) — yS Xt]? Dw(t), where y~! 
denotes the inverse operator of y. [For a proof, see Tsai (1983).] 
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REMARK 3.3. When a*(R*, {0}) = 0 and a*((t, 00), {1}) = a(t, 00), then St) 
defined in (3.1) reduces to 








A a(t, œ) + 27 d(x, > t al X œo) +n, —d, 
(3.2) §°(t) = ( ) = 1 ( F ) ( - ) ; 
a(R*)+n Xp st a( XË, 00) + nis) 
where X% n, and d, i= 1,..., m, are defined in (1.2). If a(t, co) is a continu- 


ous function in R* and S£(+) and S£(-) do not have any common discontinuities, 
(3.2) will be a version of the formula derived by Susarla and Van Ryzin (1976) 
with the following main differences: 

G) If S£(-) and S£(-) have common jump points, then formula (3.1), as well as 
formula (3.2), will reduce to the Sp; defined in (2.1) as a*(Q) > 0, but the 
formula derived by SV (1976) will not. 

(ii) It should be clear that our censoring scheme is more general than the 
censoring models considered by SV (1976). 

(iii) In the present paper, it is assumed that the Dirichlet process prior is given 
to the probability measure of the random vector (X,, 6,), whereas in SV (1976) 
the prior is incorporated in the survival function S° of X° and they do not 
consider any prior in the distribution of the censoring variable. 

(iv) The estimator ° is derived with respect to squared error loss on y~ (8°), 
while the result of SV (1976) is derived with respect to squared error loss on S°. 


4. Discussion. In this paper we have presented a unified approach to 
estimating the survival function of right censored data which combine the results 
of Efron (1967), Breslow and Crowley (1974), Meier (1975), Susarla and 
Van Ryzin (1976), Peterson (1977), and Langberg, Proschan, and Quinzi (1981). 
We only consider “exclusive censoring,” where the censoring observation is of the 
type X? > X,. There are no conceptual difficulties in extending this method to 
“inclusive censoring” problems (where the censoring observations are of the type 
X? = X,), to “doubly censoring” problems (where the censoring observations are 
of the type X > X, or X° < X,), to the competing risk problem, or to other 
incomplete observation problems. 

In Section 1 we assume that (X,, §,),...,(X,, 5,) are independent, identically 
distributed random vectors. The results of Sections 2 and 3 still hold even if this 
assumption is weakened so that (X,,5,),...,(X,,6,) are independent random 
vectors, not necessarily identically distributed. 

The proofs of these results proceed along the lines of the proofs given in 
Sections 2 and 3, so we will omit most of the details. The main chore remaining 
to complete the proof is to establish (2.8) with some modification. 

Let 


Si(t) = P(X,>t,6,=1) fori=1,2,...,n, 
Si(t) = P(X, > t,6,=0) fori=1,2,...,n, 


S) = ~ È Sie), 
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and 
Ke) == È SD). 
If 
[P(X > 4X, = x, S, = 1) dSi(x )- [ao > DS -) 
for i= 1,..., n, then we have 
S$ (t) = “Ey. (X? >t) 

= =o {sxe + S(t) - Saray ita) 

— S(t) __ 

= SH) + SLA) ~ [Goce PSA2) 


Therefore, Theorems 2.7 and 3.1 still hold as long as conditions (A.2) and (A.3) 
are satisfied when S, and S, are replaced by S, and S, respectively. 
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RANK TESTS FOR INDEPENDENCE FOR BIVARIATE 
CENSORED DATA! 


By Dorota M. DABROWSKA 


Carnegie—Mellon University 


The paper discusses statistics that can be used to test whether two 
failure times, say X, and X,, are independent. The two variables are subject 
to right censoring so that what is observed is Y, = min(X,, Z,) and 8 = 
K(X, = Y,), where (Z,, Z,) are censoring times independent of (X,, X3). 
Statistics that generalize the Spearman rank correlation and the log-rank 
correlation are considered, as well as general linear rank statistics. The 
Chernoff—Savage approach is adopted to show that suitably standardized 
versions of these statistics are asymptotically normal under both fixed and 
converging alternatives. 


1. Introduction. Let X, = (Xin Xon) and Z, = (Zim Zon) n= 1,..., N, 
be mutually independent sets of nonnegative bivariate random variables (rv) 
defined on a common probability space (Q, F, P). The X,’s and Z,’s are 
independent identically distributed (iid) rv’s with continuous joint distribution 
functions (cdf) F and G, respectively, and marginal cdf's F,, F} and G, G,. For 
each n= 1,..., N, the observable rv’s are given by Y, = (Yin Yon) and 6, = 
(6,,,55,), where Y,, = min(X),, Z,,), ôa = 1(X,, = Yn) and I(A) is the indica- 
tor function of the set A. The variables X,, and X,,, are thought of as survival 
or failure times, and may represent lifetimes of twins or married couples, times 
from initiation of a treatment until first response in two successive courses of 
treatment in the same patient, etc. For each subject we observe his survival time 
X,, or censoring time Z,,,, t= 1,2, whichever occurs first, together with an rv ô,„ 
indicating if he has left the study due to death or withdrawal. Further discussion 
of this type of censoring can be found in Campbell (1981, 1982), Clayton (1978), 
Hanley and Parnes (1983), Langberg and Shaked (1982), and Leurgans et al. 
(1982). 

This paper deals with the problem of testing the hypothesis of independence of 
survival times y: F = F\F,. For uncensored data, tests for independence are 
often based on rank statistics of the form E ,a( Rin, Ran), where R,, and Ro, 
are ranks of X,,’s and X,,’s and a(i, j) is a real valued function. The scores a 
are typically generated by some functions (u, v) on the unit square by taking 
expectations a(i, J) = EY(U,,), Vep) where Uy) < © < Uy) and Va) < +: < 
V, y, are independent ordered samples from the uniform distribution on (0, 1) [see 
Shirahata (1974) and Ruymgaart (1973)]. 

Censored-data ranks of X,,’s can be defined as in Prentice (1978) and 
Kalbfleisch and Prentice (1980). Let N, be the number of uncensored observa- 
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tions among Y,,’s, i = 1,2, and let R,=(R,,..., R,y) be given by Rn = Tm 
where r,, = #{m: Ym < Y,,,5,, = 1}. Here uncensored observations are ranked 
among themselves whereas censored observations are assigned the same rank as 
the nearest uncensored observation on the left. For each d = (d,, da), d, = 0 or 
1, let Ay = {n: 6,, = dis 6,, = d2}. The censored data ranks of X,,,’s are thought 
of as the collection of all possible rankings of X,,,’s.that are compatible with the 
observed values R,„ = 1r,, and Ag. 

Let us assume that the joint and marginal distributions of X, = (X,,, Xon) 
have densities (pdf) f,(s, t), fo, (8), and fo {© where the parameter 8 belongs to 
an open subset O C R containing the origin and the hypothesis of independence 
is equivalent to #%: 6 = 0. Then, in the uncensored version of the experiment, 
the joint probability of the set of censored-data ranks of X,,,’8 is given by 


(1.1) (NIN) ETT [14 Fa (Unn) Fa (Venn); 0), 


where Un) < +++ < Uy, and Vg) < + < Vy,) are independent ordered sam- 
ples of sizes N, and N, “hee the a ivan distribution on (0,1) and 


als, t; 9) = fa (s)fæ (t) fo(s, t) if d= (1,1), 
= fals) f" hls, v)do ifd= (1,0), 


= fa (t) f hlu, t)du itd = (0,1), 


2 [of ta, v)dudo if d= (0,0). 


A locally most powerful rank test (LMPRT) for %5: 0 = 0 against #,: 6 > 0 can 
be based on the score statistic from (1.1). The term “LMPRT” refers here to the 
rank test that is LMPRT in the uncensored version of the experiment, given the 
observed pattern of deaths and withdrawals. A straightforward calculation shows 
that, under suitable regularity conditions (Hájek and Šidák, 1967, page 70), this 
test is based on a statistic XN a( Rin Ron» Sin» 5o,), where 


a(i, J, di, d 2) = E4(U,), V gy d,, d,) 
(1.2) N, 


x i my,(1 — Un)” Th my,(1 — Vin) 


Here a = #{n: R,, =k, 8,,=0}, m, = #{n: R= 2} for ¿=1, 2 and 
k= 1, --.,N, and J$(u, v, d), d,) = © (Fou), Fao); 0)/ 
(F (u), Fy (o) 0), where ©,(s, £ 6) is the partial derivative of ©,(s, t 0) 
with respect to @. 
The score generating functions J often take form of a product so that the 
corresponding test is based on a linear rank statistic LY a (Rin 51,)4o( Rons on) 
where for i = 1,2 


N, 
(1.3) a (j, d) = EZ, (U, d) JI mal g Un) 
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and # (u, d), d = 0,1, are functions satisfying 
(1.4) [4401 dv = - (1 — u) J. {u,0). 


These scores were considered by Prentice (1978) and Kalbfleisch and Prentice 
(1980) in the two-sample problem for time transformed location models. In the 
context of testing for indepéndence, Cuzick (1982) and Wu (1982) have derived 
(1.3) as scores of the locally most powerful tests in the so-called Bhuchongkul 
(1964) model. 

With J(u, d) = ln(l — u) +d we are led to the log-rank (Savage) scores 
statistic 


N 
UO £ (À(Yin) = 81, (A(n) = ban)» 

n=l 
where A, are Nelson (1972) estimators of the marginal cumulative hazard 
functions A, = —In(1 — F). The choice of ¥,(u, d) = d — (1 + d)u corresponds 
to the censored-data version of the Spearman test. In general the exact scores 
(1.3) might be hard to compute. Therefore, following Prentice (1978), Kalbfleisch 
and Prentice (1980), and Cuzick (1982), we shall consider approximate scores 
statistics 


N 
Sy = X AÊ Yn) Bin) AP Yan), Sen), 
n=l 

where F, are estimators close to the usual Kapian-Meier (1958) estimators of the 
marginal cdf’s. The exact definitions of A, and Ê, are given in Section 2. In 
Sections 3-5 suitably standardized versions of these statistics are shown to be 
asymptotically normal under both fixed and converging alternatives. The proof of 
the asymptotic normality of T, and S, patterns the Chernoff-Savage (1958) 
approach to the asymptotic distribution of the two-sample linear rank statistics, 
and extends results of Ruymgaart et al. (1972), Crowley (1973), and Crowley and 
Thomas (1975). The results can be used to derive Pitman efficiencies of these tests 
under general, not necessarily contiguous, alternatives. This problem will be 
considered elsewhere. 


2. Asymptotic distribution of log-rank and approximate scores statis- 
tic: assumptions and results. First let us introduce some assumptions to be 
used throughout this and subsequent sections. 


ASSUMPTION A.2.1. For each N = 1,2,..., Xj,..-,Xn, and Z,,...,Zy are 
mutually independent sets of tid nonnegative bivariate rv’s with continuous cdf ’s 
Fy and Gy = G and marginal cdf ’s Fiy, Foy and G), G}. For some (continu- 
ous) cdf F, Fy > F as N > œ. 


For each N=1,2,..., define Ly(s, t, dy do) = P(Yin S 8, Yon 545, = 
dis Sen aa do), Ay(s, t) is PCY, <S S, Yon s t), H,n(8) = aes Ss 8), and 
K,,(s) =1— P(Y,, > 8, ôn = 1), i= 1,2. Under assumption A.3.1 these cdf’s 
may be easily expressed in terms of Fy and G. Moreover, L, H, H, and K, their 
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limiting distributions, exist and depend on F and G only. Finally, let L, Ê, A, 
and É, denote the corresponding empiricals. In terms of these empiricals we hove 


. $ ; AR, 
Ke) = [ie a hte) 1— Tt yey} 


where A> is the left-continuous version of A. 

The proof of the asymptotic normality of suitably standardized versions of Ty 
and S, relies on a decomposition into sums of leading terms which are asymptoti- 
cally normal, and remainder terms, which are asymptotically negligible. As 
regards the statistic Sy we assume that the score generating function J, and J, 
satisfy the following smoothness and boundedness conditions. 


ASSUMPTION A.2.2. Fori=1, 2 and d= 0,1, ¥(u,d) are continuously 
differentiable functions on [0,1) such that 


(2.1) | F(u,d)|<or(u)* and | ¥/(u,d)|< er(u)®, 

where r(u) = (1 — u)7! and c > 0, a, b, > 0 are constants satisfying 

(2.2) a, +a, <i, bta <4, a,+0,<+4. 
Further, we eliminate degenerate cases by assuming 


ASSUMPTION A2.3. K,(0) <1. 


With an abuse of notation, in what follows J {F n) PLP) denote functions 
JAF y(s), d) and I(F(s), d), respectively. For N = 1,2,..., define 


Aon = [N79 Fi) A Rw) a(L i Ly), 
Arn = [NPWin(l — Fin) Fi Fin) fo Faw) dLy, 


= [N'?Wey(1 - Fon) Si Fin) F4( Few) dLy, 
where for i = 1,2 
(2.3) Wir(s) =f (Al - Hy)r( Hw) dK y+ f'r(Hyy) d(K, — K,y). 
LEMMA 2.1. Let the assumption A.2.1 be satisfied and let J, and J, be 
functions such that A.2.2 holds with (2.2) replaced by 
(2.4) a,+a, <i, b <a +i, ba <a +1. 


Then with probability 1, N'/?X}_,A,_, is a sum of iid ro’s with mean zero and 
absolute moment of order 2 + ņ, uniformly bounded above for some 7 > 0. 
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The proof is deferred to Section 3. To standardize Ty and S, for location and 
scale define 
by = H Fy, G) = EJ (Fin(Yin), Sin) Aal Fan(Yan), Son), 


2 
on = 0 °(Fy,G) = ver 2 Any) 
k=0 


(2.5) 


Under conditions of Lemma 2.1, of, is well defined and converges to of = 
o?(F, G) = Var(X}_ Ayo), where the variance oê is evaluated under F and G, 
and the terms A,, are defined as A,y, k = 0,1,2, with Fy, H,y, K,y and Ly 
replaced by their limiting distributions. Further, with probability 1 


2 
N'*(Ty - tw) = } Ant By and 
k=0 
(2.6) 
N?(Sy - ty) = È Aan t+ Cy, 
k=0 


where By and Cy are remainder terms. 


THEOREM 2.1. Let the assumptions A.2.1 and A.2.3 be satisfied. Suppose 
that o > 0 for J(u, d) =f,(u,d) =|n( —u)+d or J, and JY, satisfying 
A.2.2. Then N'?(Ty - by) and N™?(Sẹ — py) converge in distribution ta 
N(0, oĉ). 


The proof of the theorem is given in subsequent sections. For uncensored data, 
the Chernoff—Savage approach to linear rank statistics strongly hinges on certain 
probability bounds for the empirical processes and the Brownian Bridge [Pyke 
and Shorack (1968) and Govindarajulu, Le Cam, and Raghavachari (1967)]. 
When censoring is present, it is not known if these bounds are satisfied by the 
Kaplan-Meier estimator or the estimator Ê; therefore assumption A.2.2 imposes 
stronger boundedness conditions on the score functions than is necessary for 
uncensored data. Note that for uncensored data F(s) =(N + DEN KX, < 
s), z = 1,2, and the conclusions of Theorem 2.1 follow from results of Ruymgaart 
et al. (1972). 

In general the asymptotic mean and variance of Ty and S, depend on the 
underlying joint distributions of both survival and censoring times. If there is no 
censoring, formulas (2.5) reduce to the mean and variance given by Ruymagaart 
et al. (1972). Under the null hypothesis H,: F = FF, if the condition (1.4) holds 
then E[ ¥,(F(Y,,), §,,)|Z,,] = 9, the asymptotic null mean is equal to zero, and 
oè = EX F\(Yin)s 8p FF Yon)s 82n) [See also Cuzick (1982)]. The variance 
can be further simplified by applying the following result on integration by parts 
in two dimensions, 


LEMMA 2.2 (Young, 1917). Let f(s, t) be a function of bounded variation 
such that f(s,0) = f(0,t)=0. Then for any bivariate cdf G we have 
TESEI, t) dG(s, t) = SEIGE, t) di(s, t), where G(s, t) = [PJP dG(u, v) is 
the joint survival function corresponding to G. 
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Conditioning on (Z; n» Zon) applying (1.4) and Lemma 2.2 we obtain after some 
algebra 
(2.7) oo = EJ?(F, (Yin) 2 (FAY, n)) Sindens 
where f (u) =$.(u, A — $,(u,0). Note that (2.7) implies of(F,G) < og(F, G’) 
whenever G and G’ are cdf’s such that G(s, t) < G(s, t) for all s, t > 0, i.e., the 
asymptotic null variance increases as the dependence between the censoring times 
Z,, and Z,, increases in the sense of Lehmann’s (1966) quadrant ordering. 
Furthermore, o7(F,G) < fif (u, 1) dufi¥?(u, 1) du so that the asymptotic null 
variance is bounded by the asymptotic null variance of the corresponding rank 
statistic for uncensored data [based on scores J (u, 1)]. 

In the case of the log-rank statistic Tys (2.7) es = ogl F, G) = Eô „ban 
so that tests based on N’”T, 67}, 627 = N~'DN_,6,,50,: are asymptotically 
distribution free [see also Cuzick (1982)]. Further 


every 2.2. Under assumptions of Theorem 2.1, if (1.4) holds then ô 625 = 
NLM FPP w(Yin) £2 Fon(Yon))Oindon is a consistent estimator of the 
eae i null variance of Sy. 


The outline of the proof is given in Sections 3 and 5. 

For most purposes it is enough to consider the null hypothesis, fixed and 
contiguous alternatives. If either only one variable is subject to censoring or the 
censoring variables are independent, exact permutation distributions are avail- 
able under the null hypothesis [see Cuzick (1982) and Wu (1982)]. Furthermore, 
as pointed out by a referee, it should be possible to derive asymptotic normality 
results under the null hypothesis and contiguous alternatives by applying a 
suitable modification of the Aalen (1978), Gill (1980), and Anderson and Gill 
(1982) martingale approach to linear rank statistics. In particular, under 3, by 
conditioning on the potential censoring times (Z,, Z.) and the scores generated 
by Y,„, and using the formulation of Mehrotra, Michalek, and Mihalko (1982), 
the conditional expectation required for the martingale property should appear 
straightforward, and similar to Cuzick (1985). 

Theorem 2.1 can be easily extended to the case of independent but noniid 
continuous survival or censoring distributions. It can be also generalized to allow 
score generating functions with a finite number of discontinuities of the first kind 
and discrete underlying distributions. Assumptions needed for these extensions 
are similar to those in Ruymgaart (1974, 1979); due to cumbersome notation we 
shall not discuss this problem in more detail. 


3. Proofs of Theorem 2.1 and 2.2: leading terms. The proof of Lemma 2.1 
rests on a repeated application of inequalities 


(3.1) [IYn < 8) - Hy(s)|, 


|I(Y,, s 8) = H,y(s)|< r(Hiw(Yin)) "7(Hiy(s)) T? 
for any y € (0.1). Further, Fi, < H,, and A.2.2. imply 


(3.2) |$,(Fx(s),d)|<r(Hv(s))", — |£/(Fy(8), d)| < r(Hw(s))” 
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for d = 0 or 1. Finally, we shall need Hölder’s inequality 


1/p; 


(83) f\y(Hyy)a(Hay) | dB < (flax) ds) (f'e a)”, 


where #, and ġ, are functions on (0,1) and p,, p > 1 satisfy 1/p, + 1/p, = 1. 
Note that if a, + a, < } then for any n, 0 < 4n < } — a, — ay, there exist 
Po q, > 1 such that 1/p, + 1/p, = 1, 1/q, + 1/q, = 1, and 


(3.4) (a, Fit 2n)P, <1, @2P < 1, aq, < 1, 
l (ap+}+ 2n)qo <1 


[see Ruymgaart et al. (1972)}. 


Proor oF LEMMA 2.1. We shall show that each of the terms A,,, k = 0,1,2, 
is a sum of iid rv’s with mean zero and absolute moment of order 2 + y uniformly 
bounded above for some 7 > 0. By symmetry it is enough to consider the terms 
Aon and Ay only. 

Let M denote a generic constant, independent of N and underlying cdf’s. Set 
a = a, + a, and without loss of generality assume that (2.1) is satisfied with 
b,=a,+1. 

We have N'A on = LT Fi(Fin(Yin)s Sin) AA Fon (Yon): 82n) — Hw Which is a 
sum of iid mean zero rv’s. Applying (3.2) and (3.3) with p, = a/a, we obtain 


E\( Z (Fin (Yin), Sin) Fol Fan (Yon), Ben) f Ss M f'r(uys?*” du < œ 


provided 7» > 0 is chosen so that a(2 +7) < 1. This however can always be 
achieved since a = a, + a, < 4. The upper bound does not depend on N or 
underlying cdf's. 

Further, we have N'/7A,, = EN, A;n where 


n=l 


Arn = [Winall — Fin) Ff Fin) Pol Faw) dy. 


The process W,, is defined as W,y with A, and É, replaced by H,,(s) = 
I(Y,, <8) and R,(s)=1-I(Y,, > 8,8;, = 1). Applying (3.2) and (3.1) with 
y = i 4+ 9, we obtain after some algebra 


[Anl SM r(Hay(Yig)) 7 + fray" du) fr Hy) Hay) dHy. 


The 2 +7 moment of the random part on the right-hand side is finite and 
independent of N because (5; — 72+ n) < 1 for all 7 > 0. The second term is 
bounded above by 


1/P32 


1/py 
Ce en du) (uy du) Zo 
0 0 
provided 0 < 4n <  — a, — a, and p, and p, are as in (3.4). O 


PROOF OF THEOREM 2.1. The proof of the asymptotic negligibility of the 
remainder terms By and C, is given in Section 5. With an appropriate choice of 
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functions ¥, and f, Lemma 2.1 and Esseen’s theorem imply that N1/*(Ty — 
By)/oy and „NSn — py)/on converge in distribution to (0,1), provided 
lim inf, _, 04 > 0. Finally, applying Theorems 5.5 and 5.4 in Billingsley (1968) it 
is easy to verify that of > of as N > œ. O 


PRooF oF THEOREM 2.2. Let Lij(s,t)=P(Y%,,<5 5 Yo,<t in 
8), = 1) and let Ê, be the empirical counterpart of L,,,. Under the val 
hypothesis 


(3.5) By — 085 = (FF) IAF) (Èa - Lnn) + Dyn 


where D,, is a remainder term. The first term is an average of iid mean zero rv’s 
whereas the second term is asymptotically negligible (see Section 5). Therefore, 
by the law of large numbers, 64,5 > og, as N > œ. O 


4. Decomposition of remainder terms. Set A = A, X A, where A, = 
(0, max, Y,,], ¿= 1,2. 

For ¥, and £, satisfying assumption A.2.2, the remainder term C, in (2.6) is 
given by Cy = L}.,Cyy where 


Cin = JN A(R) - Fi Fw)) AFi) aL — An, 
Coy = [NAF Alf) — Sol Fon)) di, - Aon, 


Con = JN? 4A) -ALEN AAP) -A By) ob 


The remainder term B, in (2.6) is given by By = L3_,B,, where B,y are 
defined as C,,, with (u, d) = nà — u) + d and Ê replaced by 1 — exp(—A,). 
The terms B,, and Byy, Ciy, and C,y) are symmetric so in what follows we shall 
consider B,, and C,y only. 

For any 7 € (0,1) let A, = A,, X Az, where A,, = [0, Yer] and y,, = inf{s: 
H(s)2 1-17}, i= 1,2. Then Cin = LhayCy, and Cy = = E? Cza, where 


Cy, = Soo N'?W, (1 = Fin) Ail Fw) Fo Fon) d(L a Ly), 
Cy = > Meuse Maw =F Fin) AiE n) Fol Faw) dLy, 
Cis = fs Nn ILE) -A (Fin) F (1 = Fiy) Windi Fin) Fol Foy) dL, 


Ci, fe ae (aÊ 1) —$;(Fin)) Fo( Fy) ab = — Cy, 
C= Í NOAA) ~ Fi Fiw))( Fal P) -A Faw) ab, 


Cy. = fo NALA) -ALF n) Fo( Fy) di, 


258 D. M. DABROWSKA 


(the dependence of these terms on N is taken as understood). Here the process 
W,w is given by (2.3). Analogously, B,, = L}.,B,, and By = X$, B}, where 
B,, and Bs, are defined as C,, and C;, with ¥(u, d) = In(1 — u) + d and Ê 
replaced by 1 — exp(—A ie 

Let us recall now some properties of estimators Ê and Â, i= 1,2. Both Ê 
and Â, are right-continuous step functions with jumps at uncensored observa- 
tions. Further, for any s > 0 


(4.1) F(s)<N(N+1)'A(s), i=1,2, 


which can be verified by applying a similar argument as in Gill (1980, page 36). If 
#, satisfy smoothness conditions of A.2.2 then by the mean value theorem 

(4.2) ZAPE), d) -2(Fn(8), d) = (Ês) - By(s)) 4(,(s), d) 

for s € A, and d = 0 or 1. The function ©, is defined by ®, = Fy + 0 (Ê - F, 
where 6, = 0 (w, N, d, s) is a random function valued in (0,1). Without loss of 
generality let us assume that @, does not depend on the value of d. Further, if £ 


and #, satisfy boundedness conditions of Lemma 2.1, then by (4.1) and van 
Zuijlen (1978), for N = 1,2,... 


(4.3) up| 4(F,) (Hn) = Ox) and sup| 4/(®,) Ir( Haw)" = O1) 


uniformly in N and underlying distributions. 
Finally, let us recall consistency and weak convergence results. For i = 1,2 


(4.4) sup|Â, —- Ful >p0, — sup|A,— Anl > p0 
as N — œ [Földes and Rejtö (1978) and Gill (1980)]. The processes N'?{Â, — 
A,n) and N/?(# — Fy) converge weakly in D(A,,) to W, and (1 - F)W,, 
respectively, where W, is a mean zero Gaussian process with almost all sample 
paths continuous and covariance cov(W,(s), W,(t)) = JE r(F)?r(G,) dF, 
[Breslow and Crowley (1974) and Gill (1980)]. 

In the course of the proof of Theorems 2.1 and 2.2 we shall also use 
sup N19, y/ = op(1), sup N’? |, nv] = op{1), 
Ay A, 
sup NW. y| = Op(1) 
for Qı = Ê - Fy - (1 - Fw)Wiws Rin = A, ~ A,n ~ Win, and N sufficiently 
large. 


(4.5) 


5. Proofs of Theorems 2.1 and 2.2: asymptotic negligibility of remainder 
terms. The asymptotic negligibility of the terms By and Cy in Theorem 2.1 
will be established by a sequence of lemmas showing that B,,, Cis Baz, and Cgo 
converge in probability to 0 as r > 0 and N — oo, whereas the remaining terms 
tend in probability to 0 for any fixedr € (0,1) and N > oo. 


LEMMA 5.1. For fixed 7 € (0,1), Ci Cy, Bı and B, converge in prob- 
ability to 0 as N > œ. 
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Proor. Let us consider first the term C,;. Using (4.2) we can write it as a 
sum of 


Cis. = f NQF iD) Fol Fan) dL, 
ANA, 


Gas J a NOO- Fw) Win A O) — Fi Fin) Fol Fo) di. 


Let r € (0,1) and e > 0 be fixed. There exists a constant M, = M,(1) such that 
for N large enough sup|F,y — P| < 7/3 and sup,, | %o( Fon) < M, Further, 
there exist constants M, = M(t, e, M, = M,{7, e) such that for N sufficiently 
large the sets Q, = {sup, |Ê, — Fy| < 7/3}, Ra = {supy, | F1(®,)| < Mp}, Qa = 
{sup,, N'/?|W,y| < M3}, and Q, = {A, C A} have probability at least 1 -— e. 
Then however 


4 
i N 24) Ca <s M,M,sup| NQ] 


kml lr 
and by (4.5) this bound converges in probability to 0. Also 
3 
i N R }[Cy30] < M,M,I(Q,)sup|-J/(®,) — Ai Fw) | 
kel Ir 





For d = 0,1 the function §/(u, d) is uniformly continuous on [0,1 — 7/8] so 
that |®, — Fial < Ê - Fn! and (4.4) imply that this bound tends in probability 
to 0. A similar argument combined with (4.3) and (4.5) shows that C,, converges 
in probability to 0 as N > œ. The asymptotic negligibility of B,, and B} 
follows immediately from (4.4) and (4.5). O 

Lemma 5.2. For fixed t € (0,1), B,, > p0 and C,, > p0 as N > œ. 

Proor. The proof is similar to Ruymgaart et al. (1972). Assuming that 
functions J, and JY, satisfy assumption A.2.2 with b, < a, + 1, it is enough to 
consider the term C,, only. 

Let r € (0,1) and e > 0 be fixed. For any positive integer m, define x,,,(3) = 
Yk — 1)/m for y,(k — 1)/m < 8 < Y„k/m, k = 1,..., m, where A,, = [0, y,,], 
i = 1,2. For arbitrary m we have |C,,| < L3_,C.1,,. where 


Chum = b; NIW, a(s) 7 Win(xim(8)) || o(s, t, d), d,)\|a(2 t Ly), 


Crem = fina NM xam(9))| 
x jols, t, d,, də) ~ o(Xim(s), X2m(t), d,, d,)|a(L + Ly), 


Cham = 





i N'?Win(xim(8)) O(Xim(8), Xam(t); dy, d) da(È me Ly) , 


and 9(s, t, dı, d3) = (1 — Fiy(s)Fi(Fin(s), d) AA Fyn (2), do). 
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There exists a constant M, = M,(r) such that for N large enough sup|F, 
F| < 1/2 and sup, || < M, Further, there exists a constant M, = M,(r, €) audi 
that for N large adi the sets Q, = {sup, N’/|W,y| < M3} and Q, = 
{A, C A} have probability at least 1 — e. 

Let us consider the term Ciim- The process NW, y converges weakly in 
@(A,,) to W,. Therefore, by employing a Skorohod construction, 


supa, NPW, = Win © Nami >p 0 as N, m > © 


and there exists a sequence nhmys my — 0 as N, m > œ, such that the set 
Q,, = (supa |Win — Win ° Xim! < nmn} bas probability at least 1 — e for all m 
and N sufficiently large. Combining, Qi N 2,9 Qm)Cium < Manmay > 0. 

Further, for d = 0,1, the functions Y/(u,d) and J(u, d) are uniformly 
continuous on [0,1 — 7/2] so that for N sufficiently large my = 
sup, |$(s, t, d,, d3) = (Xinl 5) Xen); d,, d,)| > 0 as m> oo. Hence KQ, N 
22)Ciim S Mamy > 0 as m, N > œ. 

Finally, for N sufficiently large and each w € Q, N Q., the integrand of C,,5,, 
is a step function which assumes value a,,,,,(w) for d = (d,, da), d, = 0,1 and 
(s, t) belonging to Rpm = (Y, (k — 1)/m, Yı,k/M] X (Yoel — 1)/m, Yz,l4/m], 
k,l = 1,..., m. Therefore 


KQ, N 22 )Ci13m = 


TMs 


kim 


m 1 1 
È È L arma |. d(È = Ly) 
[=1 d,=0 d,=0 R 


< 16m7M,(M, + £,,7)sup|L — Ly| 
and the bound converges in probability to 0 as N > œ. O 
LEMMA 5.3. B,, and C,, converge in probability to 0 as t > 0 and N > œ. 
Proor. Assuming that functions Y, and Y, satisfy A.2.2 with b, <a, +1, 
it is enough to consider the term C,, only. Let n, 0 < 4n < } — a, — ag, and 
e > 0 be fixed. As shown in Ruymgaart et al. (1972), applying dominated 


convergence theorem and Hélder’s inequality (3.4) with p, and p, as in (3.5), we 
can find 7 = 7(e) such that for all t < 7 


f yr A) dH < e. 
A; 


For 7 and N sufficiently large the set 2, = {A, C A} has probability at least 
1 — e. Further, by A.2.2 (with b, < a, + 1) and (3.1), [Cy4] < D3-:Ci4,, where 


Ca = fea (fia — Ain (Hn) dy |r( Bins) )* 'r( Hyy(t)) dHy, 


as 1/2 
Cie = i N 
AUAS 





[7 Hn) a(R, -K 





r( Hyy(8))"r(Hey(t)) dHy. 
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By Lemma 4.2 in Ruymgaart et al. (1972), there exists M, = M,(e) such that the 
set Q, = (sup N'AT — H,nir(Hin) "r(A — Hy)" < M,} has prob- 
ability at least 1 — e uniformly in N. Therefore 


I(2,)Cy4 S M, fr(Hn) in fea Ha) OO En) dHy. 


The first term of this bound does not depend on N and the underlying cdf’s. The 
second term is smaller than e with probability at least 1 — e for all r < 7 and N 
sufficiently large. This implies C,,, ~p0 as 7 > 0 and N > œ. 

The proof of asymptotic negligibility of C,,. is similar. 0 


LEMMA 5.4. For any c, and Cg, C} + Cy < } 


G2) Bin = Donal Vhs = Air( iy) r(Hon)* dH, 


(5.2) Ey = Í aA i "7(H>)r(Az - N`’) ak. 
x7(Hiy(s))"7( Hey (t))* dH 
converge in probability to 0 as t > 0 and N > oo. 


Proor. Let n, 0< 4n < }-— ci — cp and ¢>0 be fixed. Similarly as in 
Lemma 5.3 we have 


f An AOH da >p 0 
ANAS 


as tT — Oand N > œ. 

Let A, = {s: 1 — H, n(8) > £/N}. By Theorem 1.4 in van Zuijlen (1978), the 
set 2, = {A, Z A,} has probability at least 1 — e. We have I(R,)Ein < Ey, + Eig, 
where 


E, = f wel [lar = Hyy\r(Hyw)r( Az) ak, 
x7r( Hiy(s))°r(Hey(t))* dh, 
E= Les [rthw) d(K, = Kın) 


(the dependence of these terms on N is taken as understood). 

By Theorem 1.1 and Corollary 1.1 in van Zuijlen (1978), there exist constants 
M, = M,(e) and M, = M,(e) such that the sets Q, = {sup,r(Hy)r(H,y)~! < My} 
and Q, = {sup NPA — Hyylr(H,y)'?-"7( — Hin)?" < M,} have prob- 
ability at least 1 — e uniformly in N. Therefore 


(Q, N Q,) Ey, Ss M,M, {r( Hy)" df, fr Haw) Pr Haw) di 


r(Hyw(s))"r( Hoy (t))* dÊ 








and the bound converges in probability to 0 as 7 > 0 and N > oo. The proof of 
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the asymptotic negligibility of E, is similar. Finally 
(Q, O Q,)Esn s MN? f r(Hn) adhi f (Hn) Hn) aA 


< N-'A( Ne)? [r(Hyy)' "dA, 


xf Hi Ha adh 
AS 
and the bound converges in probability to 0 as 7 > 0 and N > œ. 0 
Lema 5.5. B and Cy. converge in probability to 0 as t > 0 and N > œ. 


Proor. Let e > 0 be fixed and let Q,, 2,, and &, be defined as in Lemma 5.4. 
Given e>0 there exists a constant M, = M,(e) such that the set 0, = 


{supa, r(Hy )r(H2y)~' < M,} has probability at least 1 — e uniformly in N. 
Therefore 


KA; N 2,)|Bys| < (m [r(En)® aft, + 1)E 


Here E,„ is given by (5.1) with c, = 0 and c, < } chosen arbitrarily. By Lemma 
5.4 the bound converges in probability to 0 as r > 0 and N > œ. 
The term C, can be written as 


Com f NUR ~ Fin) FP) Fa Faw) a. 


Applying inequalities jx, — x| < |Inx,—Inx,| for 0<x,,x,<1 and 0< 
—In(l - (1+ x)7')-(1+x)7! < (x(1 + x))7! for x > 0, it can be easily seen 
that for w € Q, and s € A, 


|Fi(s) — Fin(s)|<|Â:(8) — Ayy(s)|+ 2N7 fra Y)r( Hy - No ') dK. 


By (4.5), there exists a constant M, = M,(e) such that the set Q, = 

(sup, | F/(®))\7( Hw)” < Mj} has probability at least 1 — e uniformly in N. 

Therefore KO$ 2, NO 2,)|Cay| < MM (En + 2E.y) where Ey and Ezy are 
given by (5.1) and (5.2) with c, = b, and c, = a, as in assumption A.2.2. By 
Lemma 5.4, the bound converges in probability to 0 as r > 0 and N —> œ which. 
implies the asymptotic negligibility of Cy. O 


PRooF OF THEOREM 2.2. The remainder term Dy in (3.5) is a sum of 


Daf (F(A) -AEn ALF) ba 
Dow = fi FA) -AE n) Alfa) dh, 
Dow = f (FA) -A Fay) ) A Fin) dh 
Dux = f (FEB) -Al Faw) Ain) dhs 
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Applying mean value theorem and a similar argument as in the proof of Theorem 
2.1 it can be shown that D,y > p0 and Dy — p0 for any fixed r € (0,1) and 
N > oo, and D,y > p0 and Diy > p0 as 7 > Oand N > œ. 0O 
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In the course of conducting a clinical trial to compare K (2 2) treat- 
ments, it is often desirable to balance the trial with respect to the assignments 
of patients to treatments. On the other hand, some form of randomization of 
treatment assignments is essential for reducing experimental bias. In this 
article, the large-sample approximation to the null distmbution of K-sample 
randomization tests generated from a broad class of restricted randomization 
rules 18 derived. The implication of this result for conditional inference is also 
discuased. 

1. Introduction. In most comparative clinical trials patients become avail- 
able one at a time for treatment and must be assigned to a treatment group upon 
arrival. One of the most fundamental statistical issues in the evaluation of new 
treatments is how to allocate patients to treatment groups during the course of 
the trial. Avoidance of experimental bias may best be achieved by adopting some 
random mechanism for patient assignments. The most straightforward kind of 
randomization scheme, simple (or complete) randomization, assigns each patient 
with probability K~! to one of K possible treatments, assignments being made 
independently of one another. Simple randomization has the advantage that each 
treatment assignment is completely unpredictable, and it provides a basis for 
statistical inference. However, in small-to-moderate sized experiments, simple 
randomization may result in severe imbalance among the numbers of patients in 
the treatment groups. Pocock (1979, p. 188) recommends the simple randomiza- 
tion scheme only in large trials with over 200 patients. Even then, if one analyzes 
early results while the trial is in progress, the scheme may not be satisfactory. 

An alternative to complete randomization is the use of a restricted randomiza- 
tion rule to ensure comparability of treatment numbers during the course of the 
trial. In this article we focus on a class of adaptive treatment assignment rules. 
The term “adaptive” here indicates that the treatment assignment of the 
(n + 1)st patient may depend upon the assignments of the first n patients, but 
not upon the observed responses (cf. Simon, 1977). Now, suppose that at the end 
of the trial we are interested in testing the hypothesis H, that there is no 
difference among K (2 2) treatment groups. Suppose further that it is inap- 
propriate to postulate that patients in the trial have been obtained by random 
sampling from a certain population. Then, experimental randomization of treat- 
ments to patients is the basis for inference (see Lehmann, 1975, Chapter 1). The 
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significance tests used for testing H} must be only those tests generated by the 
experimental randomization design actually employed. For small trials, the 
randomization tests may be performed by computer simulation (cf. Simon, 1979, 
page 508). For large trials, however, this procedure becomes unwieldy. 

In the comparison of two treatments, a large-sample approximation to the null 
randomization distribution of test statistics for testing H, has been obtained by 
Smythe and Wei (1983) for a particular class of adaptive designs. In this paper we 
study the asymptotic null distribution of test statistics for K-treatment compari- 
sons under the randomization model, when treatment assignments are made by 
restricted randomization schemes described, with examples, in Section 2. The 
main limit theorem, which provides the asymptotic null randomization distribu- 
tion of the K-sample test statistics, is presented in Section 3. In Section 4 we 
pursue a suggestion of Cox (1982) concerning conditional randomization tests and 
present a conjecture regarding convergence of conditional distributions. For 
continuity of presentation, the proof of the main theorem is deferred to the 
Appendix. 


2. Restricted randomization rules. Efron (1971) proposed the “biased 
coin” design in the two-treatment case as a compromise between a perfectly 
balanced assignment scheme and a completely randomized one. This was gener- 
alized by Wei (1978a) to an “adaptive biased coin” design. Adaptive procedures 
for K > 2 treatments have been proposed by Wei (1978b), Efron (1980), Atkinson 
(1982), and Smith (1984a). Asymptotically, Wei’s procedure balances the experi- 
ment completely, i.e., in the limit, each of the K treatments receives a portion 
1/K of patients. Atkinson’s procedure is based on the concepts of D and D, 
optimality and allows different limiting proportions of patients assigned to 
treatment groups. Smith’s procedure, which we will consider in this article, 
generalized Wei’s scheme to achieve prespecified limiting proportions é,,..., x, 
where $, > 0,1 <j < K, and ye Teo 

Smith’s procedure works in the following way. Suppose that after z assign- 
ments (i > 1) there are N, patients in treatment group j, where 1 <j < K and 
LAAN, =i. Let p= (py, Po- Px-1)’ be a (K — 1) X1 vector whose jth 
component is the probability that treatment J will be assigned to the (z + 1)st 
patient. We regard p as a function from Q to Q, where Q = {y = (),,.-.5 Yg-1): 
y,2 0, DX51y, < 1}. This function depends on the assignments of the first i 
patients through the vector i~'N,, where N, = (N,,,..-, Nx_3,,). Given y, let 
Px(y) = 1 — Eki phy) and yg = 1~ Eji yp Then p,,..., pg are assumed to 
satisfy: 

(A.1) If y2, then ply)<é, 1lsj<K. 


(This formulation differs from the one in Smith (1984a), but is essentially 
equivalent to it.) Here are some examples of treatment assignment rules satisfy- 
ing (A.1): 


EXAMPLE 2.1. The adaptive biased coin design (Wei, 1978a): Let K = 2 and 
let p, be a continuous nonincreasing function from [0, 1] into [0,1]. Let $; be the 
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unique value satisfying p,(¢,) = $; then p= p, satisfies (A.1). If p, is not 
assumed continuous, but p,(x)is > 4, = 4, or < } according as x < į, x = 3, 
or x > i, then p, satisfies (A.1) with &, = $, = 3. Efron’s (1971) biased coin 
model is of this type: p,(x) =A for x < 4, p,(4) = 4, and p,(x)=1-—A for 
x > 4,where} <A <1. 

EXAMPLE 2.2. Simple randomization: Take ¢,= K! 


l<jsK-1. 


and p(y) = K7}, 


EXAMPLE 2.3. Generalized urn design (Wei, 1978b): Let §, = K“! and 
p(y)=1-»)AK-1,1lsjsK-1. 


Asymmetric designs can easily be constructed by modifying this example: If, 
say, K=3, £,=4, & = 4%, and & = $, let py) =(1—y,)/2 and pay) = 
47 ie 

EXAMPLE 2.4. Atkinson’s (1982) design: If we again take $, = K~'.1<j< K, 
this design gives 


K 
ply) =(y%-1)/ E (yt-1), isis K-21. 
i=] 


Smith (1984a) showed that (A1) holds in this case. 


The following proposition is crucial for the proo” of the main theorem in the 
next section. It shows that the desired proportions a v achieved in the limit, for a 
continuous p satisfying (A1). 


PROPOSITION 2.5. Let p satisfy (A.1). If each p, is continuous, then 
p,(i7'N,) > pf, asitoo, forl<j< K. 


Proor. By (A.1), p,(&) < &, 1 <j < K. If any of these inequalities were 
strict, we would have, summing both sides from j= 1 to J= K, that 1 <1; 
hence p,(&) = £,, 1 <j < K. Lemma 4 of Smith (1984a) gives that oN >$, 
for 1 <j < K — 1; the result now follows from Slutsky’s theorem. O 


3. Randomization tests based on restricted randomization. Suppose 
patients have been assigned to treatments by a restricted randomization rule and 
at the end of the trial {x,,...,x,} is a sequence of observed responses. Let the 
corresponding scores of the x’s be denoted by a,,,...,@,,, where a,, may be, 
e.g., a function of the rank of x, among all x’s. Furthermore, let T,, be 1 if the 
ith patient is assigned to treatment j and 0, otherwise (j = 1,2,..., K). We are 
interested in testing the hypothesis H, that there is no difference among the K 
treatments. In this section the large-sample approximation to the null distribu- 
tion of K-sample randomization test statistics will be derived under a randomiza- 
tion model. 
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If simple randomization of patients to treatments is employed (so that 
§, = K`! for each /) it is known that the following condition on the scores {a,,,} 
implies asymptotic joint normality (after standardization) of the vector 
(Oe; Kye 


(3.1) lim max ał/ } a?,=0 


nao l<isn iat 


(cf. Theorem 5 of the Appendix of Lehmann (1975)). 
We will consider the class of designs satisfying (A.1) and the following 
additional condition: 


(A.2) pis twice continuously differentiable with bounded second derivatives. 


The next theorem will show that condition (8.1) is also sufficient for the 
asymptotic normality of the standardized Za,„T, resulting from designs satisfy- 
ing both (A.1) and (A.2). First, a lemma and some notation are needed: 


LEMMA 3.2. Letd,, = dp,(€)/dy,. Thend,, = 5,,y,1 < k, J < K — 1, where 
5,, is the Kronecker delta and + is a constant. 


Proor. The argument is similar to that made in Section 3 of Smith (1984a); 
our y and Smith’s p are related by y= —p. Recall (Proposition 2.5) that 
p,X&) = &,. For ô sufficiently small, either positive or negative, and k # j, 

(3.3) Pelis sees Sy-]9 $; ok ô, an EKAI x1) < by = p,(§). 
Thus 0p,(§)/dy, = d,, = 0 if k #7. Next, observe that for sufficiently small ô, 
tx = Dx(€, +ô, £o — 8, $a- §x-1) = Ex + dx, — bdxy + 0(8). 


This implies that dx, = dyo and similarly that all dx,,1 <j < K — 1, have the 
same value. But since EX, p,(y) = 1, it follows that EX da= 0,1 <k<K-1, 
and hence that d,, = —dx, = y for 1 <j < K — 1. (From (3.3) it is evident that 
y¥<0)0 


Now define the sequence of modified scores {6,,,} as follows: 


n Qin i—2 
(3.4) bn= anty È heia], i=1,2,...,n, 


tert Ll 1 jee 
where by convention [1).., = 1 if Z < i, and let së = L7_,bj,. Define 


= ain « 
W,,= X= (Ta &) l<sj<K-1. 


i=] “n 


THEOREM 3.5. Suppose that the sequence (T,,)*.,, i= 1,..., n is generated 
by a design satisfying conditions (A.1) and (A.2). If (3.1) holds, then the random 
vector W; = (W,,,)*., converges in distribution to a multivariate normal distri- 
bution with mean 0 and covariance matrix $ = (0,,), where o,, = §,(1 — &,) and 
oR, = ENE fs k +j. 
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A natural statistic for testing H, is then T, = W/2~'W,, which has an 
asymptotic x%_, distribution. (Note that $ is simply the covariance matrix of a 
multinomial random vector with cell probabilities £,,..., x-1) 

We conclude this section with several remarks concerning Theorem 3.5. 


REMARK 1. Examples 2.2-2.4 of Section 2 satisfy condition (A.2). For Efron’s 
biased coin model, where p is not even continuous (cf. Example 2.1), asymptotic 
normality may fail to hold (cf. Smythe and Wei, 1983). However, restricted 
randomization designs with the balancing action expressed by conditions differ- 
ent from (A.1) are certainly possible. One seemingly natural alternative is given 
by 
(A.1’) y,2&, ifandonlyif pfy)<é, 1lsi<sK-1. 


Here is an example satisfying (A.1’) but not (A.1): 


EXAMPLE 3.6. Let K = 3, £, = 1, = 4, & = $, and let p(y) = i- y), 
pAy) = 41 — yZ). Theorem 3.5 holds for this design. On the other hand, as noted 
by Smith (1984a), Atkinson’s procedure (Example 2.4) satisfies (A.1) but not 
(A.1’). For designs satisfying (A.1’) but not (A.1), Lemma 3.2 need not hold, so 
that Theorem 3.5 is not always applicable. 


REMARK 2. A real question of interest in inference for these designs concerns 
the potential loss of accuracy incurred if a significance test is performed assuming 
simple randomization, when in fact a restricted randomization design has been 
used. For the designs considered in Theorem 3.5, the accuracy of such an 
approximation clearly depends on the ratio 07_,a?/2"_,b?, which will depend on 
the value of y (defined in lemma 3.2). A small simulation study performed for an 
urn design by Smythe and Wei (1983) when K = 2 suggests that if the variance 
due to simple randomization is used in the analysis, the true significance level is 
considerably smaller than the nominal level of the test. 


REMARK 3. As pointed out in Smythe and Wei (1983), condition (3.1) on the 
{a,n} is general enough to permit applications of practical interest. For example, 
let ọ be a real-valued function defined on (0,1) with {2¢7(x) dx < oo, and suppose 
that either: (a) ¢ is monotonic or (b) ¢ is continuous and monotonic on (0, e) and 
(1 — «,1) for some e > 0. If a,,, = o(i/n + 1), it is easy to show that (3.1) holds. 
In particular, if F is a strictly increasing distribution function with a finite 
second moment, then [AF x)? dx < œ and a,, = F~\i/n + 1) satisfies (3.1). 


REMARK 4. Smith (1984b) has shown that the range of values of s24 = L?_,b2, 
provides an indication of the vulnerability of the design to experimental bias. 
Theorem 3.5 makes possible a more rigorous statement of some results in Sections 
5, 6, and 9 of that paper. 
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4. Conditional inference. Cox (1982) introduced the idea of a conditional 
randomization test, whereby the significance level is computed conditionally on 
N,, — No, (in the case K = 2) or some other indicator of the balance of the 
design. The question therefore arises whether the results of Section 3 may be 
used to construct conditional tests. 

For the sake of simplicity, we consider only the case K = 2, ¢, = £, = 1, and 
D, = Nia — Nans though the same discussion applies to K > 2 and to other 
conditioning variables. Assume the {a,,} are rescaled so that £"_,b?,= 1 and 
define an array {6,,} from (3.4) with n~" in place of a,,. For any real « and £, 
the linear combination aW, + BnD, converges to a normal distribution, by 
Theorem 3.5. Specifically, 


aW, + BnD, = Dats + 2pn-”) (1, -— 5). 


n 


If b = ab,,/8, + 28, and §? = D782, then Theorem 3.5 shows that aW, + 
Bn~'/?D, has approximately, for large n, a N(0, +52) law, with 
1 2 2 b,n b,n a 
ŝi = (a, B) zr 72 (3) 
2È binbin 400% 
The Cramér-Wold device (cf. Billingsley, 1968, page 48) then implies that 


(W, n~ 'D,) is asymptotically approximated by a bivariate normal with mean 0 
and covariance matrix 





í FE b,nbn 
a L b,nb,n Loz 


This suggests the 


CONJECTURE 4.1. The conditional distribution of W, gwen D, can be ap- 
proximated _asymptotically by a normal distribution with mean 
n` D, È, bind n/ 2,0 a) and variance 4 40l i È b,n in) Œb) 


In“ 

The conjecture is not a corollary of Theorem 3.5, because asymptotic condi- 
tional normality does not follow automatically from asymptotic joint normality. 
Steck (1957) and Holst (1981), among others, have considered such questions for 
sums of independent random vectors, but there is no obvious way to extend their 
technique here. If D, were treated as a categorical variable with J categories, the 
conditional inference implied by the conjecture could be rigorously justified. 


APPENDIX 


PRooF OF THEOREM 3.5. Let 
K-1 


(B.1) Y, = ( £ (8,n/8,)(T,, — p,(N,-,/(i = »)) ’ 


t=] gal 
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where b,„ is defined in (3.4). We first show that under the condition 


(B.2) max b2,/)) 62,30 asn- 0, 
lsisn i=l 
Y, converges in distribution to Nx_,(0, $). 
The proof employs the Cramér-Wold device. Given a set of constants 


Qis- --, &g— consider the random variable 
K~1 

(B.3) Qn = L a, APY, 
j=l 


where Y,, is the jth component of Y, and A = E5 a}, — ŒK)”. The 
variable @,, can be rewritten as 
n b K-1 N,_ 
Zn, where Z„ = 2a E a 1,— p| 1 |]: 
1 


n 
Sn jul r—1 





™ 


Since E[T,,IN,1]=p,N,-./(i— 1), it follows that the array {Z,,} is a 
martingale difference array with respect to the o-fields F, generated, for each n, 
by {Te j=1,....K-1,k = 1,...,i} (the o-fields do not depend on n). We 
appeal to Theorem 2.3 of McLeish (1974) to prove asymptotic normality of Q,. 
Conditions (a) and (b) of this theorem are trivially satisfied by (B.2) and the fact 
that AW? OE a (T, — p(N,,/(i — 1))) is bounded. To check condition (c), 


ion! 
which is that È? Za >,1 as n> oo, we define UZ=U'_,22,, Vi= 


ral“z 


|. £(Z2,|F,_,). It follows from (B.2) that the conditional Lindeberg condition 


gm 


holds: For e>Q0, 


n 
È E(ZRI( Zul > EF, -1) > Os 
t=} 
where I(-) is the indicator function. Also, it is easily seen that sup, P(V,2, > A) > 0 
as A — œ, 80 it follows from Theorem 2.23 of Hall and Heyde (1980, page 44) 
that max,|U,2 — VŽ] >, 0 as n > oo. So if we can prove that VŽ, >, 1, it will 
follow that U2, = I7.,22, >, 1. 


But E(Z2\F,_,) = (b2,/As2)D,, where 


D, = e| E alr, - pi(N,-/(i - fis] 


dml 


= E aff piN,./Aé~ D) = rH aG- D) 
m E E aanpi(N (i a 1))Pa(N.-1/(i ~ 1)), 


mel 


and using Proposition 2.5, D, converges in probability to A as i > oo. Thus 
Vz, > pl, and we conclude from McLeish’s theorem that Ef Zn, > a N(0,1). 
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Since a,,...,@,_, are arbitrary constants, it follows that Y, converges in distri- 
bution to Nx_,(0, $). 

Next, using condition (A2), expand p/(y,,..-, Yg-ı) about the point 
(n-o Sc): 


K-1 
PAY Yk) = piln- §x-1) + 2 d,,( y, a £) 


(B.4) K-1K-1 9? Ta - Ng-1) 
tL a ee 


j=l m=1 dy, I Ym 
where (7, ..., ng—ı) lies on the line segment joining y to & We will show that 


(Bs) E va{' 5 (Fee -oF e) -a as nto. 


1m1 j=l l= 


(y, = Em — Em) 





For this it suffices to prove that 


(B.6) x Pal( Tact g) > 0 as n > œ. 

Ti su isl : P 

But from Lemma 4 of Smith (1984a), E(N, ,_,/(i — 1) — & Y < K?{i — 171, s0 
that the expectation of (B.6) is bounded by 


K? z |bal K? Bale Pyn ; ~4/3 re 
Sn 2i S Sn (i ia | [Ec-» | , 


using the Cauchy—Schwarz inequality. In view of (B.2), it follows easily that this 
converges to 0 as n > oo. Thus (B.5) is established. By Lemma 3.2, d,, = Yô, 
where 4,, is the Kronecker delta, 1 < J, 1 < K — 1. By (8.4), (B.5), and condition 
(A.2), Y, is asymptotically equivalent to 


| es 6 -re- "Es : 











n r=1 jmi 


(with the convention that 0/0 = 0), which can be rewritten using (3.4) as 


rad a 


To complete the proof of Theorem 3.5 we will show that (3.1) implies (B.2). To 
do this we first show that max,b,, < C) < max,a,,, and then that L,a?, < 
C,¥,0;,,. 

From (3.4) we have 


il-2 
(B7) [Bal < laal + |Y( max Jan) Z je- o Tia +v 


i=i+1 jut 
< max ja,,|{1 + lyI@,,}, 
lsisn 
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where G,, = Ehua- IMIG + yN. It is easily seen that for |y| < i, 
T1221 + yj!) < OU- 2)/i)’ for 1-2 > i, with C constant. From this it 


follows that G,, < M < œ for all i and n, proving that |b,„| < C max,|a;,|. For 


the second part, observe that 
n 
(B.8) Qin = b,n = Y 5: b/d = 1) = b,n 
gmttl 
Hence 
2 1/2)? 
La, = D (b,n = Ve) S [5,7 (Le,| , 
i t l 


by Schwarz’s inequality. 
Now 


ee, Lit DEH) 


t yord>e 
=E B/J- +25 E E bnbm/ (U - 1)(1—1)} 
= Lo -1) +25 Lb, nOn/(l— 1) 


1/2 
<s?+ 2E Baen <8? + 2s {Ea ; 
i 


where we have used Schwan s inequality in the last in ity. If c? = Ef c2, 
we get sê + 2s,€ — k > 0, whence we must have s, > (y2 — 1)e. It now follows 
from (B. 8) that’ La? < (1+ |y[l + v2)s? completing the proof of Theorem 3.5. 


Acknowledgments. We are grateful to the referee for helpful comments and 
to the editor for references related to the conditional distributions problem. 
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FINITE STOPPING IN SEQUENTIAL SAMPLING WITHOUT 
RECALL FROM A DIRICHLET PROCESS 


By RONALD CHRISTENSEN 


Montana State University 


This paper shows that for sequential sampling without recall from a 
Dinchlet process, there exists a finite bound beyond which one will cease 
sampling with probability 1. The result 1s valid for Dirichlet processes defined 
on closed bounded intervals of the real line. 


1. Introduction. This paper shows that for sequential sampling without 
recall from a Dirichlet process on a bounded interval of the real line, there exists 
a finite constant that is an almost sure upper bound to the sample size of an 
optimal procedure. As a referee has stated, “[This] shows that not even God can 
fool a Bayesian forever.” This section introduces the problem and notation. 
Section 2 contains the result. Christensen (1983) contains additional results on 
sampling without recall from Dirichlet processes. 

A shopper must buy an item. He can elicit price quotations sequentially, but 
must pay for each quotation. A price obtained is valid only at the time of the 
quotation. The shopper can buy the item at the current price but is not allowed 
to return to previous prices to buy the item. The shopper seeks to minimize his 
total cost, i.e., the price paid plus the cost of the price quotations obtained. 

Suppose a sequential random sample X,, X,,... is available from a possibly 
random distribution, say F. Without loss of generality, the first variable X, is 
observed at no cost. The observer (shopper) can choose to stop sampling and 
accept the observation X,, or pay a cost, say C, and take another observation. 
For any j > 1, after X,, X,,..., X, have been obtained the observer can either 
accept X, and make no further observations, or pay C and observe X,,,,. If X, is 
accepted, the observer’s total cost is X, plus the sampling costs up to that point, 
(j — 1)C. The observer’s goal is to find a way to minimize his total payments. 
Clearly, the observer’s problem is to find the best method of determining when to 
stop taking observations. If N is any stopping rule, the observer seeks a stopping 
rule N’ so that 


E(Xy, + (N’- IC) = min E(Xy +(N-1)C). 


This problem is known as sequential sampling without recall. 

The distribution F has not yet been discussed. Most of the literature on this 
problem deals with the special case where X,, X,,... are independent and 
identically distributed (i.i.d.) from the fixed distribution F. DeGroot (1968, 1970) 
considered the problem where F is a normal distribution with unknown mean W 
and variance 1. Prior beliefs about F are incorporated by putting a normal 
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distribution with known mean u and known variance of on the parameter W. 
Rothschild (1974) restricted F to the class of multinomial distributions with a 
fixed number of categories, say n. The probabilities for each category were 
unknown but prior beliefs were modeled with an n-dimensional Dirichlet distri- 
bution. Rothschild suggested generalizing his work by taking the random prob- 
ability measure F to have a Dirichlet process. 

In related work, Ferguson (1974) briefly considers the problem of sampling 
with recall from a Dirichlet process. Clayton (1985) has found a similar bound on 
sampling in a sequential testing problem using Dirichlet processes. 

The notation follows Ferguson (1973) and Rothschild (1974). For a Dirichlet 
process defined on a subset of the real line, say Z, with parametric measure a(-), 
the notation F ~ O(a) is used to indicate that F is a random observation from 
the Dirichlet process. The parameter a is assumed to be a finite measure. The 
weight of the measure a is denoted W = a(#). When F ~ 2(a), F is used to 
denote both the Dirichlet process and a realization of the process. 

The marginal distribution of an observation, p, from F is a(-)/W. This 
marginal distribution is also denoted as E(F) and EF, because the marginal 
distribution of p, can be used to define the expected value of F when considering 
F as a distribution function. 

When p has been observed from a Dirichlet process F, the posterior distribu- 
tion of F given p is denoted F|p. F|p ~ G(a(-) + 5,(-)), where 6,(-) is a 
measure that gives point mass 1 to p. If p’ is an observation from Fp, the 
marginal distribution of p’ can be written as 


E(Fip) = [o(-) + 8,(-)|/LW+ 1]. 


Some additional notation: 6, = (p,,---, p,), Ø without a subscript is used for 
Pn» EpC) denotes expectation with respect to the distribution of p, E,,(-) denotes 
expectation with respect to the joint distribution of a sample p,, po,... from F, 
and P,(-) denotes the probability measure of sets depending on pj, Do,..-- 

The observer’s problem is deciding when to stop sampling. Let 7 be a stopping 
time for sequences p,, Po,..-, of observations on F. The terms stopping time, 
stopping rule, strategy, and procedure will be used interchangably. Following 
DeGroot (1970), without loss of generality attention can be restricted to strate- 
gies in A = {1|P,(7 < œ) = 1}. For the expected payment under the strategy 7, 
the notation 


V(F, 1) = Ep( p, + (r-1)C) 
is used. The minimal expected payment is also known as the value of the search 
problem. It is denoted 
V(F) = inf V(F,7). 
TEA 
A strategy m in A is optimal if 
V(F, 1) = V(F). 


The following theorem shows that an optimal strategy exists under a regular- 
ity condition. 
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THEOREM. Let p,, Pz, P3,--- be a sequence of identically distributed random 
variables whose common distribution function is A. If Var,(p,) < œ then there 
exists 1 E A such that V(F,1) = V(F). 


Proor. See DeGroot (1970, page 352). 0 


In particular, if the set Z, on which A is defined, is a bounded set, then an 
optimal stopping rule will exist. 

Henceforth, any Dirichlet process referred to will be assumed to have a 
parameter a that gives a finite variance for the distribution of prices, ensuring 
the existence of an optimal rule. 

Unfortunately, the class of optimal rules is a very broad one. Let r be a 
stopping rule, we say that 7’ is a version of 7 if r = 7’ almost surely (a.s.). 

It is convenient when describing optimal procedures to think of p as the 
current observation and F as the process updated by all observations except the 
current one. The function V(F|p) is the function V(-) evaluated at the process 
F|p. Any optimal procedure dictates stopping, a.s., if 


(1.1) p< V(Fi[p)+cC 
and dictates continued sampling, a.s., if 
(1.2) p> V(F\p)+C. 


With this convention, we say that it is uniquely optimal to accept p if (1.1) holds 
and it is uniquely optimal to reject p if (1.2) holds. We say that it is optimal to 
accept p if 

p< V(Fip)+C 
and it is optimal to reject p if 

p> V(Flp)+C. 

The discussion above implies that V satisfies the functional equation V( F) = 
J mini p, VF|p) + C)dEF. 

A truncated version of the search problem is one in which the number of price 
quotations the observer is allowed to elicit is bounded. If the values of truncated 
problems converge to the value of the untruncated problem as the number of 
price quotations increases, the results proven for the truncated problem can be 
extended to the untruncated problem. 

We need notation for the value of a truncated problem. The value of a problem 
truncated at the initial step is 


V(F) = [pdE F(p). 


The value of a problem truncated so that an observer is allowed no more than 
T > 2 price quotations is 


Vr(F) = f min[ p, Vr- (Flp) + C] dE F( p). 
Yahav (1966) has shown that if [p° dE F{ p) < œ then Vr > V(F). 


278 R. CHRISTENSEN 


In Section 2, an example and a corollary refer to the reservation price 
property. This property is simply that if it is optimal at any stage to accept a 
price po, then it is also optimal at that stage to accept any price p < po. It is by 
no means clear that this property holds for sequential sampling without recall 
from a Dirichlet process, Rothschild (1974) proved that this property holds for 
the Dirichlet multinomial sampling problem. Unfortunately, his proof does not 
seem to extend to the Dirichlet process problem. Christensen (1983) contains a 
proof valid for Dirichlet processes. 


2. Finiteness of search. The result is that, when £=[—M, M], an opti- 
mal strategy is a truncated strategy, i.e., there exists some number, say N, such 
that an optimal procedure will not continue sampling past the Nth stage. The 
crucial result needed is a bound on the cost of continued sampling. The cost of 
continued sampling is the cost of another observation, C, plus the expected cost 
for proceeding with an optimal strategy after determining to take another 
observation. Lemma 2.1 considers the interval [-M, —M + kC/2) for an arbi- 
trary positive integer k. If one samples for a sufficiently long time and never gets 
an observation less than —M + £C/2, one becomes convinced that such observa- 
tions are unlikely to occur. One essentially concludes that the best observation 
one can get is ~ M+ kC/2. The cost of taking another observation and then 
following an optimal rule is at least (~M + kC/2) + C. In fact, one cannot do 
quite that well because one is not absolutely sure that observations less than 
~M + kC/2 cannot occur. In the following theorem we obtain a lower bound of 
—M + (k + 1)C/2 for the cost of continued sampling. 


LEMMA 2.1. Suppose that F ~ (a), and p, = (Pis ---, Pn) ts observed. For 
a positive integer k assume that p, > —M + kC/2 fori = 1,2,..., n; then there 
exists an n, finite such that if n = n, then the cost of continued sampling at the 
nth stage is at least —M + (k + 1)C/2. The n,’s can be chosen to form a 
nondecreasing sequence for k = 1,2,... withn, = 1. 


Proor. The proof is by induction on &. 

(a) The initial step: Let n, = 1. At any stage (the nth stage is where f, has 
been observed) the cost of continued sampling is at least —M + C, the cost of an 
observation plus the best price that can be obtained. Since —M + C/2 < 
—M + C, the theorem holds for k = 1. 

(b) The inductive step: Assume that the theorem is true for & and show that 
it is true for k + 1. The hypothesis of the theorem for k + 1 is that p, > ~M + 
(k + 1)C/2, i = 1,2,...,n. Clearly, it is also true that p, => -M + kC/2, i= 
1,2,..., n. Using the induction hypothesis, there exists n, such that if n = n,, 
the cost of continued sampling is at least -M + (k + 1)C/2. 

A better bound is needed. Let 8, g+, be the probability that the (n + 1)st 
observation is in the interval [—-M, -M + (k + 1)C/2) given that none of the 
first n observations were in the interval. Since sampling is from a Dirichlet 
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baan a[i ace Eo) fm 


When sampling is continued, the cost of an observation, C, is always incurred. 
Having obtained the observation p, ,,, two cases are considered. First, if p,,, is 
less than —M + (k + 1)C/2, the least additional cost possible (regardless of 
whether p,, ,, is accepted or not) is — M. The second case is when p,,, is at least 
—M + (k + 1)/2. If sampling stops, the cost is at least ~-M+(k+1)C/2. If 
sampling continues, the induction hypothesis applies because p, > -M + 
(k+1)C/2 > —M + kC/2 for i = 1,2,...,n,n +1 and n+ 1 is greater than 
n,. Since the induction hypothesis applies, the cost of continued sampling is at 
least —M + (k + 1)C/2. Thus, in this case, regardless of whether or not we 
accept p,,,,, the cost is at least -M + (k + 1C/2. 

Since the probability that p,,,, is less than -M + (k + 1)C/2, is Bn, 441, the 
cost of taking the (n + 1)st observation is at least 


k+1 
— MB, k+ + [-m + to) = re 


Rewriting (2.1) gives the lower bound for the cost of continued sampling as 


(2.1) C+ 





k+1 k+1 
-M + —— C+ l: =f. rains 
2 i 2 
Since „+1 > 0 as n> co we can pick n,,, >n, such that if n 2 ng, 
then (1 — Ba, k+1(k + 1)/2) = 3. Thus for n 2 nyi 


k+2 k+1 C 
Men GMa C+— 








2 2 
k+1 k+1 
<-M+ C+ |1- B]e 
2 f 2 
and so the lower bound, —M + (k + 2)C/2, is established, proving the theorem. 


O 

Lemma 2.1 is not as strong a result as it may at first appear. There are many 

sequences of observations for which it does not apply. The following example 

produces one sequence of observations about which the lemma has nothing to 
say. 


EXAMPLE. Suppose a((—M, MJ) = 100 and a(-M, -M + C/2)) = 75. n} 
can be defined so that if n>n, then [1 — 8, ,(k/2)] = 4 or equivalently 
Bi» < 1/k. Recall B, p= e([-M, -M + Ck/2))/(W + n), so in particular, 
Bn,2 = 75/(100 + n) and n, > 50. 

From Lemma 2.1 if k = 1, n > n, = 1, and p, > —M + C/2 fori=1,...,n 
then any value of p, between —M + C/2 and —M + C will be accepted. By the 
reservation price property any value of p, < —M + C will be accepted. Siri- 
larly, if k = 2, n > n, 2 50, and p,=> -M +C for i=1,...,n—1 then any 
Pa < ~M + 3C/2 will be accepted. 
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Choose £, £ with, 0 < e, < C/2. Let p= -M + C +e, and take p, > 
—M + 3C/2 for i = 2,..., n where n > n,. Lemma 2.1 has nothing to say about 
such a sequence of observations. The lemma applies with k = 1 but all the p,’s 
are too large to say anything about. For k = 2, the lemma has nothing to say 
about p,,.--, Pnr,-1 (ng — 1 > 49) and for i = 50 the p,’s are too large to say 
anything about. For k = 3,4,5,... the lemma does not apply because p, < 
—M + kC/2. 


Since at the mth stage the cost of continued sampling is V(F'|p,) + C, Lemma 
2.1 implies that if p, > —M + kC/2,1=1,...,n and n > n, then 
+1 
2 
If p, < —M + (k + 1)C/2, it is optimal to accept p„. 

The following corollary establishes that for any price, say p, if enough price 


quotations greater than p have been obtained then p would be accepted if 
observed. 





k 
-M+ C < V(F|p,) +C. 


COROLLARY 2.2. Let p=p, and assume p,>p, i=1,...,n—1. There 
exists an n* = n*( p) such that if n > n* then it is optimal to accept p,,. 


Proor. For some k we have p © [—M + kC/2, -M + (k + 1)C/2). Clearly 
the conditions of Lemma 2.1 are satisfied. Let n* = n,. If n > n*, the cost of 
continued sampling is at least -M+(k+1)C/2>p=p,.0 


The corollary can be strengthened as follows. 


COROLLARY 2.3. Let p, 2 p,i=1,...,n— 1 and p, < p. There exists n* = 
n*( p) such that if n > n*, then it is optimal to accept p,. 


Proor. Take n* as in Corollary 2.2. If p, = p then it is optimal to stop. The 
reservation price property indicates that if p, < p it is optimal to accept p,. O 


We now show that all optimal search procedures are bounded with probabil- 
ity 1. 


THEOREM 2.4. Let F~ Ga). There exists an integer N such that any 
optimal procedure stops sampling no later than the Nth stage of the search with 
probability 1. 


Proor. Since every optimal procedure has a version that stops sampling 
whenever p,, < V(F|p,) + C, it is enough to show the result for such procedures. 

Suppose that no such N exists; then for any S there exists an optimal strategy, 
7, and observations p,, Do,---; Pg for which r( Pi, Do,-.-, Ds) > S. That is, 


P, = ViF|p,) +C, ba 1 cS. 
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It will be shown that for large enough S, pg > M, which is a contradiction to 
x= [-M, M]. An inductive argument is used to show that in order to have a 
sequence of prices that avoids stopping, the prices must get systematically larger. 
Eventually they exceed M. An upper bound N is found as a number such that if 
S > N, the contradiction results. 

(a) The initial step. —M + C is always a lower bound for the cost of 
continued sampling, so any price less than —M + C would be accepted. There- 
fore, we can assume p, > ~M + C, i= 1,2,..., S. 

(b) The inductive step. We want to show that if the sequence p,,..., Pg never 
stops and if the observations are eventually above some constant, then the 
observations must eventually be larger than another, larger, constant. In particu- 
lar, suppose that p, > —M + kC/2 for i= N,, N, + 1,..., S; then there exists 
N, > N, so that p,> -M + (k + 1)C/2 for t= Ny, Npu,+1...,S if 
S > N, Note that for k = 1, N= 1. 

After N, — 1 observations have been obtained, the updated Dirichlet process 
is 


Ngo} 
Fl By,-1 ~ alat) EL sO). 
tm] 
Lemma 2.1 can be applied to the updated Dirichlet process. The lemma says that 
since P,,n,-; Z “M + kC/2 for i= 1,2,..., S — N, + 1 there exists n, (de- 
pending on a(-) + L%718,(-)) such that if n > n, the cost of continued sam- 
pling when #,,,,,_, has been observed is at least -M + (k + 1)C/2. 

Ignoring, for the moment, that n, depends on p,,..., Pn,- we complete the 
inductive argument. If S — N, + 1 > n, then fori = ny, Rgp- S — Ny + Lit 
is optimal to stop for any p,,y,-, < M + (k + 1)C/2. Since p,,..., Pg does 
not stop, it must be that p,,y,-) > ZM + (k + 1)C/2 for t= ny, ryy1,---, 
S—N,+ 1. Rewriting, we must have p,=> ~M+(k+1)C/2 for i= N, + 
n,—1,N, + n,,.-.,S. Letting N,,, = N, + na — 1 the inductive step is proven. 

Since N will later be chosen as a function of the N,’s it will not do to have 
N, 4, depend on p,,..., Py,-1- In the proof of Lemma 2.1, n, was taken so that 
for n > np, (1 — kB, ,/2) = 4 where 


oe a([—M, -M + kC/2)) 
ae Wn 
This condition is equivalent to picking n, so that for n > np, Bn, < 1/k. In the 


current instance we are applying the theorem to F'( py, _,). If we pick n, so that 
for n > na 





a([-—M,-M+kC/2))+N,-1 1 
WEN Ton =e 
then we automatically have 
a({[—M,—M + kC/2)) + EXT 5, ([-M, —-M+kC/2)) 1 
W+QN,-l+n Sk 


and now n, does not depend on p,,..., Py,-1- 
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(c) The contradiction. Let N= N,, where ky = [4M/C] + 1. ([a] is the 
greatest integer of a.) By the induction result, if S > N and stopping has not 
occurred for p,,..., pg then 


P,= -M+(&,/2)C fori=N,N+1,...,8S. 


Thus ps => —M + (4M/C] + 1)C/2 > ~M + (4M/C)C/2 = M, but ps > M, 
a contradiction; therefore p,,..., Ppa must have stopped previous to the Nth 
stage. O 


Although a closed form solution for N is not available, computation of N is 
easy: Ni = 1, N,,, = Np +n, — 1, and n, can be taken as the greatest integer 
in 


(#)a{|—m -M+ že) + (k-1)(N,-1) + W41. 


Acknowledgment. The author would like to thank Donald A. Berry for his 
assistance with this work, especially with the proof of Lemma 2.1 


REFERENCES 


Cuow, Y. S., ROBBINS, H., and SIEGMUND, D. (1971). Great Expectations: The Theory of Optunal 
Stopping. Houghton Mifflin, Boston. 

CHRISTENSEN, R. (1983). Searching for the lowest prices when the unknown distribution of prices is 
modeled with a Dirichlet process. Ph.D. Thesis, University of Minnesota, unpublished. 

CLAYTON, M. K. (1985). A Bayesian nonparametric sequential test for the mean of a population. in 
Ann. Staust. 13 1129-1139. 

DEGrROoT, M. H. (1968). Some problems of optimal stopping. J. Roy. Statst. Soc. B 30 108-112. 

DeGroot, M. H. (1970). Op&mal Statistical Decisions. McGraw-Hill, New York. 

Fercuson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 
209-230. 

FERGUSON, T. S. (1974). Prior distributions on spaces of probability measures. Ann. Statist. 2 
615-629. 

ROTHSCHILD, M. (1974). Searching for the lowest price when the distribution of prices is unknown. J. 
Pol. Econ. 82 689-711. 

YAHAV, J. A. (1966). On optimal stopping. Ann. Math. Statist. 37 30-35. 


DEPARTMENT OF MATHEMATICAL SCIENCES 
MONTANA STATE UNIVERSITY 
BOZEMAN, Montana 59717 


The Annals of Statistica 
1986, Vol 14, No 1, 283-297 


BAYESIAN MULTISTAGE DECISION PROBLEMS 


By JEFFREY A. WITMER 
University of Florida 


Two treatments that yield Bernoulli outcomes are available in a clinical 
trial. One success probability 1s known. A probability distribution reflects 
opinion about the other success rate. N patients are to be treated, with N 
possibly unknown, in a multistage trial. The goal is to maximize the total 
number of successes on the N patients. 

Optimal lengths for each stage and optimal treatment allocations are 
found for two-stage trials with N known. 

When N is unknown the problem is shown to be equivalent to that of 
discounting future observations. Optimal stage lengths and treatment alloc- 
ations are characterized for distributions on N that yield regular discount 
sequences. This class of distributions includes the geometric family, which is 
given special consideration. 

It is shown that if the number of stages in the trial ıs fixed and if the 
distribution on N yields a regular discount sequence, then ıt 1s optimal to use 
the known treatment in the last stage only. This extends the work of Berry 
and Fnstedt (1979). 


1. Introduction. Experimenters are often interested in comparing two treat- 
ments that yield dichotomous response. For convenience, we consider the setting 
of a clinical trial in which patients receive one of two medical treatments. The 
results obtained herein apply to other types of trials as well. 

Within the clinical trial, application of a treatment results in either recovery 
(success) or no recovery (failure). The goal of the trial is to maximize the total 
number of successes in N patients, where N may be unknown. This is equivalent 
to maximizing the total utility of the experiment when a success has utility 1 and 
a failure has utility 0. 

For each of the N patients in the trial the experimenter must choose one of 
two treatments to be applied. Treatment 1 has success probability 6, and 
treatment 2 has unknown success rate 6,. We assume that treatment 1 is a 
well-known standard already in use. We may wish to model 6, as being unknown, 
as the “well-known” standard may produce surprises in the current trial. How- 
ever, for simplicity we assume throughout that 8, is known. 

Opinion about treatment 2 available separate from the trial is expressed by a 
prior distribution on 6,. The outcomes are assumed to be Bernoulli random 
variables. When treatment 1 is applied the random variables are indexed by 8, 
and are independent. When treatment 2 is applied the random variables are 
indexed by @,, are exchangeable, and are independent of the variables indexed by 
6,. The patients are viewed as being exchangable for the purpose of the trial. 

The outcome for a patient is frequently assumed to be known before the next 
patient is treated. Petkau (1978), Upton and Lee (1976), and Berry (1972, 1978), 
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among others, have considered sequential allocation of Bernoulli processes. Of 
particular interest is Berry and Fristedt (1979). 

While mathematically appealing, these procedures may be too cumbersome to 
be used in practice. They require that the patients respond almost immediately 
to treatment and, often, that the experimenter make a continuing series of 
calculations and decisions as the trial progresses. Of course, the availability of 
microcomputers lessens the second obstacle. 

A more realistic assumption is that the data are collected at intervals 
throughout the trial. Calculations can be made at these times and the future 
course of the trial can be altered. Such a trial consists of several stages. While it 
is possible that only partial information is available from one stage to the next, 
the setting considered here is that in which all of the patients from the previous 
stage respond before the experimenter decides on an allocation scheme for the 
next stage. If the number of patients in each stage is 1 then we are in the classical 
sequential setting. 

The problem of maximizing the total number of successes when an experiment 
takes place in two or more stages has received little attention. Cornfield, Halperin, 
and Greenhouse (1969), Colton (1963), and Donner (1979) discussed this problem 
when the outcomes are normal random variables. Canner (1970) considered 
Bernoulli processes in a multistage setting with both treatments unknown, as did 
Pearson (1980). Pearson (1980) also considered the case in which one treatment is 
known. However, he required N to be fixed. 

We now discuss the notion of a strategy. The number of stages in the trial is 
chosen before the trial begins and is not part of a strategy. A strategy, often 
denoted by +, specifies K(7), the length of the first stage. Denote by K,(7r) the 
number of planned first-stage observations on treatment i, i = 1 or 2. Often we 
abbreviate K,(1) to Ki and K(r) to K. Then K = Ki + K2. As well, 7 specifies 
t,, to,..., g, where t, is 1 or 2 according as treatment 1 or 2 is assigned to the ith 
patient. Then it specifies the length and treatment allocation scheme for the 
second stage, in the same manner as for the first stage, for each possible outcome 
of the first stage, and so on. 

The length and treatment allocation scheme for a particular stage are func- 
tions of the success rate 6,, the number of stages remaining, and the current 
distributions on 8, and N. The results from previous stages are reflected in the 
current distribution of #,. The quantities N and @, are assumed to be indepen- 
dent. 

The utility of a strategy is the average utility of the histories possible when 
following that strategy; the average is with respect to @, and the initial distribu- 
tions of 6, and N. An optimal strategy is one that yields maximal expected 
utility. 

An example of a two-stage strategy can be described as follows. Take exactly 
one first-stage observation on treatment 2. Thus K = 1 and ¢, = 2. If a success is 
obtained use only treatment 2 for the second stage. That is, set 2, = t = --- = 2. 
If a failure is observed use only treatment 1 in the second stage (¢, = t +} = 1). 

For each ordered pair (K1, K2) there are various possible allocation vectors 
(t,,-.., £g). Also, there are many continuations for each possible outcome of the 
first stage. Thus, many strategies share a common (K1, K2). 
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For fixed @,, II, and Q, let U((K1, K2), 6,, I, Q) denote the maximal expected 
utility over all strategies with given (K1, K2). Let 


U*(6,,1,@) = sup U((K1, K2),6,, I, Q). 
(K1, K2) 


Optimal strategies have utility U*(@,, H, Q). 

It may be that more than one ordered pair (K1, K2) is optimal. Let K1* 
denote the smallest. value of K1 that is optimal. That is, K1* is the least value of 
K1 among all (K1, K2) that satisfy 


(1) U((K1, K2), 6, II, @) = U*(6,, II, Q). 


Similarly, denote by K* the smallest value of K1* + K2 among all (K1*, K2) 
that satisfy (1). Of course, K* may sometimes be zero, in which case the first 
stage contains no observations. Let K2* = K* — K1*. 

Throughout the remainder of the discussion we denote E(6,|IT) by p and the 
number of successes form the K2 first-stage observations on treatment 2 by S. In 
Section 2 we consider two-stage trials in which the total number of patients, N, is 
known. In Section 3 we discuss distributions on N and develop their relation to 
the discounting of future observations. Special attention is given to a class of 
discount sequences that are called regular. The main results, which extend the 
work of Berry and Fristedt (1979), are in Section 3, where we consider two-stage 
and higher-stage trials in which N is unknown. 


2. Known trial length with two stages. In this section we consider the 
case in which N, the total number of patients to be treated, is known and the 
trial has two stages. We modify our notation by writing U((K1, K2), 6,, I, N) in 
place of U((K1, K2), 6,, II, Q). 

Our goal is to maximize the total number of successes on the N patients. Given 
the results of the first stage, the optimal conditional strategy for the second stage 
clearly is to use the treatment with larger posterior mean on all remaining 
patients. Thus, our problem is to choose K and to assign treatments to each of 
the first K patients. 

When N is known we may, without loss, take all observations on treatment 1 
before any observations on treatment 2, i.e., (t¢,,...,¢,) is of the form 
(1,1,...,1,2,2,...,2). Thus, the ordered pair (K1,K2) uniquely identifies a 
strategy. It is understood that an optimal continuation will be used for the 
second stage. 

The advantage of a two-stage procedure over a one-stage procedure is that 
with a two-stage procedure we can learn about the treatments during the first 
stage and apply our knowledge during the second. Since @, is known it seems that 
no first-stage observations on treatment 1 should be necessary. Theorem 1, which 
is similar to a result by Pearson (1980), confirms this. 


THEOREM 1. Let N be known. For all 6, and for any prior, II, on 6, 
Ki* =Q. 

COMMENT. Ki =0 may not be uniquely optimal. For example, if 
P(6, < @,) = 1 then treatment 2 should never be used, and any value of K1 such 
that 0 < K1 < N is optimal. 
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Proor. Note that 
P(S =j) = EP(S = jiha) = (=) fora — 6) aI (8) 
and that the support of S is 0,1,2,..., K2. Let K1 + K2 < N. Then 
U((K1, K2), 6, II, N) = K1-6,+ K2-n+(N— K1- K2) 


x aad E(6,|S = j)}P(S =»), 


gut 





S max{6,, E(6,|S=j)}P(S= j) = U((0, K2),6,, O, N) + K1- 0, 


-x| F max{@,, ECOS = j)}P(S =z) 
720 


< U((0, K2),6,,0,N)+K1-8, 
K2 
-x| L a P(S -) 
J4=90 
= U((0, K2), 86, I, N).O 


Consider the case in which H is a uniform distribution on the interval (0, 1). 
Here P(S = j|IT) = 1/(K + 1) for j = 0,1,..., K. Also, 





1 i J+1 
E(6,|{1) = 2? E(6,|S =j, 11) = K4?’ 
Thus 
K N-K({& j+i 
(2) U(K, 06, I, N) ee Pa reed à 





Pearson (1980) derived the approximation 
K*=[(N+1)(67'-1)]'7-1 
as follows: 
Let [|x|] denote the greatest integer less than or equal to x. 


K N—K{(ieck+2)~-19 K per) 
U(K,6,,1,N) =—+ 





6, + 





2 
K+1 J=0 J~= [19 (K +2) K+2 
N-K/ K+2 6, K+1 
=- + [02 5+ | 

K+1 2 2 2 
=1{N-6(N-K)(K+1)™ 
+0?(K + 2)(N-K)(K+1)"'} 
=1/(K,6,,N), say. 

Setting d/dK f(K, 9,, N) = 0 gives the desired result. 








2 
K 
2 
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3. Distributions on N and discounted future. In section 4 we consider 
trials in which N is unknown. The purpose of this section is to explore the 
relationship between distributions on N and the commonly considered approach 
of discounting future observations. 

It is often for computational convenience that N is assumed to be known. In 
applications it is unlikely that N is known precisely: patients may drop out of 
the trial, new treatments may be discovered, etc. We model the case in which N 
is unknown by assuming that it is a random variable with known distribution Q 
on the positive integers. 

Let 7„ denote the observation on patient m when following strategy t. When 
N is known, the utility, u(r|N), of strategy r is the expected sum of the 
observations: 


(3) u(t|IN)=E YD +. 


m=l 


Suppose that N has distribution Q, under which P(N = i) = p, Let a, = 
EZ „Py the probability that N is greater than or equal to n. We assume that all 
a, are independent of all 7,. That is, the probability that the trial ends at a given 
time is independent of the results obtained on the patients. A more realistic, 
although more difficult to model, assumption is that the results obtained during 
the trial affect the decision of when to terminate the trial by, for example, 
affecting the rate of development of other treatments. 


THEOREM 2. The sequence A = (a,,@,...), Which is nonincreasing, is a 
discount sequence for the trial. 


Proor. For any strategy r, with N distributed as Q, 
u(7|Q) = Lu(s|N =n, Q)P(N = nQ) 


“LRAN= nia)| E ECIN =n, @) |. 


m=1 


The strategy r depends on Q. Thus E(z,,|N = n,Q) depends on Q but not on 
N = n. Hence f 


u(7|Q) + piE(741Q) +- +E + RR) 
= E(17|Q)( pı + pa +) + E(mIQ)( pa + pg te) + 
PEC IQ) Pn + Pmi te) H 
= E(nIQ)a, + +E (tylQ)eey + 
Thus 
(4) u(71Q) =E È antn = È anEtn, 


suppressing the dependence of + on Q. O 
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The discount sequence A assigns to each potential observation in the trial 
weight equal to the chance that it is indeed observed. 


EXAMPLE 1. Suppose Q is a one-point distribution at Nj, i.e., N is known to 
be N,. Then a, = ae = ay, = Land ay.) = os. =Q. 


(5) A = (1,1,...,1,0,0,...) 
and for any strategy 7, 


œ N 
u(t) =E Dat, = E} Tp. 


m=} m=) 


All discount sequences generated by distributions on N can be expressed as 
mixtures of discount sequences of form (5). 

Only the first N patients in a sequence of possible patients are treated in the 
trial. Note that observations are not actually discounted as they are observed. N, 
however, is unknown (before the trial begins) and, hence, in finding an optimal 
strategy we discount future observations by fixed factors aj, a»,.... 

Although we learn about N as the trial progresses, this learning does not 
change the nature of the problem. The conditional distribution of N at any given 
time is proportional to Q, the original distribution. Given that N > m, our goal 
at time m is to maximize 


E È Biv 
n=m+} 
where 
P(N= nQ) œa, 
P(N > mQ) ams 





Ba = P(N = n|N > m,Q) = 
But this is equivalent to maximizing 
oO 
E } am. 


n=m+i 


The problem of finding optimal strategies can be greatly simplified for a large 
class of distributions on N. This is the class of distributions that are regular. We 
define a distribution @ to be regular if it yields a regular discount sequence, as 
defined by Berry and Fristed- (1979). 


DEFINITION. A discount sequence A = (a), @,...) is regular if, for each m, 


2 
YmIm+2 <s Ym+i 


where y, = L% a 


tmn” 


Note that all geometric discount sequences are regular, but just barely: 
Yati = YmYm+2 for all m. 
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4. Trials with unknown N. In this section we consider the case in which N 
is unknown. We consider the special case of a geometric distribution on N 
separately before developing several theorems that lead to the main result on 
regular distributions. This main result (Theorem 7) extends the sequential work 
of Berry and Fristedt (1979). 

If 6, > E@, we are faced with two conflicting goals. We wish to maximize the 
probability of success on the current patient. This requires that we use treatment 
1. On the other hand, we want to explore the possibility that 6, > 6,. To gather 
information about 4, we can assign treatment 2 to the current patient. 

Although it may benefit future patients, use of treatment 2 when 6, > Eb, 
may be inconsistent with medical ethics. We shall avoid the issue of medical 
ethics and proceed with the understanding that some combination of “the good 
of the current patient” and “the good of the whole” is appropriate. Such a 
combination is consistent with, and can be reflected in, the discount sequence A 
discussed in the previous section. 

Each distribution Q yields a unique discount sequence A(Q). If A(Q) is of the 
form (a@,,...,@,,0,0,...), then only a finite number of strategies are possible; 
hence there exists an optimal strategy. Theorem 3 shows that there exists an 
optimal strategy for any discount sequence. 


THEOREM 3. For any distributions II and Q, for any 0,, and for any (fixed) 
number of stages there exists an optimal strategy. 


Nore. This result is very similar to Lemma 1.1 in Berry and Fristedt (1979). 
We adapt their proof. 


Proor. Let D denote the number of stages in the trial. As always, D is fixed 
before the trial begins. 

Suppose that Q yields discount sequence (a,, a.,...). Further, suppose that, 
for each n, r” is an optimal D-stage strategy for 6,, II, and the discount 
sequence (a@,,..., a@,,0,...). There is a strategy 7 that through any term m agrees 
with at least one 7”. We construct 7 recursively as follows: At term m define + so 
that it agrees with infinitely many of the r” through term m. The utility of 7 is 


oO m 
u(7|Q) = } a, Ern = lim } a, Er, 
t=] m> =l 
= lim } a, Er”. 


m> oo wt 


Let 7 be any other strategy. Then 
u(t|Q) = } a, EF = lim } a, Eî, 
IR 


t= t=1 


< lim x a, Er” = u(7/Q) 


—> 
moO t=} 
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since 


m m 
x a,7, s y aT,” 


t=] iml 


by the definition of r”. Thus 7 is an optimal strategy for @,, II, and Q. O 


In Section 2 we considered N to be known and concluded that K1* = 0. This 
is not the case when N is unknown, as the following example shows. 


EXAMPLE 2. Suppose Q(1) = 0.9, Q(10) = 0.1, @, = 0.6, and 8, ~ 
uniform(0, 1). Further suppose that when K1 > 0 we take all K1 observations on 
treatment 1 before any of the K2 observations on treatment 2; i.e., (4,,..., g) is 
of the form (1,...,1,2,...,2). This will be shown in Theorem 5 to charac- 
terize optimal strategies. Then U( K1,0) = (0.9)[(0.6)(1)] + (0.1)[(0.6)(10)] = 1.14 
for K1=0,1,...,10, and U(1,1) = 1.156. Similar calculations show that 
U( K1, K2) < U(1,1) for all (K1, K2) + (1,1). Thus, the ordered pair (1,1) is 
optimal and K1* > 0. 


Nevertheless, we can often set K1 equal to zero. 


THEOREM 4. For any prior distributions II on b, and Q on N, if 6, < p then 
Kl* =Q. 


PROOF. Suppose K1(r)> 0 for some strategy 7. Consider an alternative 
strategy 17’ derived from 7 as follows: Set K(r’)= K(r), K1(rt))=0, and 
K 27’) = K(1’). Then under strategy 7’, t, = 2 for i < K(1’). 

The expected utility from the first stage when following 1’ is no less than the 
expected utility from the first stage when following 7, since 6, < p. The same 
relationship holds for the second stage, as may be seen by noting that 


Ex |max{6,, E(X) X, ] 2 max{6,, E(6.1X,)}, 


where X, is the random vector of responses for the first j observations on 
treatment 2. Thus u(r) < u(r’) and the result follows. O 


Clearly, when K1 = 0 the order of treatment assignments in the first stage is 
not an issue. When N is unknown and both K1 > 0 and K2 > 0, however, the 
order of the first-stage observations can be important. 

When 8, < p, K1* = 0 by Theorem 4. When @, > p we want all first-stage 
observations on treatment 1 to precede any first-stage observations on treatment 
2. Theorem 5, which was used in Example 1, makes this idea precise. This 
theorem states that for any prior distributions II on #, and Q on N, if 6, > p 
then it is optimal to take any first-stage observations on treatment 1 before any 
and all first-stage observations on treatment 2. 
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THEOREM 5. For any prior distributions a on 6, and Q on N, if 6, > pi then 
there is an optimal strategy of the form: t=  =tgi= l, tkg S 
tk = = 2. 


Proor. Suppose strategy 7 is such that ¢, = 2 and ¢,,, = 1 for some i, j with 
t+ J < K(t). Modify 7 to 1’ only as follows: Set ¢,= 1 and ¢,,,= 2 under 7’. 
Then u(r’) — u(t) = (a, — @,,,)(8, — p) = 0. Proceed to exchange all such de- 
viant pairs. 0 


Suppose that N has a geometric distribution with mean 1/p (p is known). As 
noted earlier, geometric distributions are regular. We will see that the case 
N ~ G(p) behaves much like that in which N is known and equal to [1/p], the 
greatest integer in 1/p, provided 1/p is large (1/p > 20, say). 


THEOREM 6. Let N ~ G(p) with 0 < p < 1. Then for any prior II on 6, and 
for any 6,, K1*(6,, Tl, G) = 0. 


Proor. If 6, < then K1* =0 by Theorem 4. Assume that 6, > p. Let 
q = 1 — p. We now derive a closed form expression for the utility function. We 
have 


U((K1, K,), 0, I, G) = U(K1, K2) = } q''pU(K1, K2|N = i). 
iJ 
By Theorem 5 we can, without loss, consider only those strategies for which all 
K1 observations on treatment 1 precede any of the K2 observations on treatment 
2. Thus, 


K1+K2 
U(K1, K,) = £q phit YO qt *p[6,K1+ pli- K1) 
t=] wm K1+1 


oe 


t=K1+K2+4+1 
where H = H( K2, 8, II) = M2 ,max{6,, E(6,|S = 7)}P(S = j). This reduces to 
6) U((K1, K2), 6,,0,G) = 6,(1 + gp!) + g® up '(1 — 0u!) 
+qKl+K2y nl Hp! Ga 1). 
Thus, U(K1, K2) is of the form 
U(K1, K2) = q*®¥(K2) + C. 
So 


K1, K2) = max max {q É'!f( K2) + C}. 
a A Me Ki K2 eee?) } 

If max xg f(K2) is nonnegative then U(K1, K2) is maximized by setting 
K1 = 0, which maximizes q*!. If maxx» {( K2) is negative then U(K1, K2) is 
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maximized by setting K1 = oo, which minimizes q*!. But this yields utility C, as 
does (K1, K2) = (0,0). Thus K1* is always zero. O 

Since K1* = 0, we can set K2 = K and simplify (6) to 
(7) U(K, 0, ,G) = pp + qup (Hp — 1). 
Expanding q“ = (1 — p)* and deleting terms involving p°, p*,..., p* yields 
(8) U(K, 0, 01,G@) =pp™ + p(p™* — K)(Ha — 1) 


=pK+(p3-K)H. 
In particular, if 6, ~ uniform(0, 1) then 
U(K,9,,G@) + 2K + (p™> -KXK +1) "[M(K, 4)], 


where M(K, 8,) = E* omax{0,, E(@,|S = j)}. 
When compared with the results from Section 2, we see that this is the same as 
(2) with N replaced by p~', the mean of the geometric distribution. Thus 


K*(G(p)) = [(p7? + 1)(977 - 1)? -1. 
We now present a generalization of Theorem 6: If Q yields a discount sequence 
A that is regular then K1*(Q) = 0. The presence of this condition greatly 
simplifies the design problem. 
To prove this result we need several lemmas. This first lemma follows im- 
mediately from the definition of regularity and is presented without proof. 


LEMMA 1. If the discount sequence A is regular then a, Ymi S %mY,.; for 
ism. 


LEMMA 2. Suppose that Q is regular and that p < 0,. Then 


(9) g,(K2) = [H(K2) - pl Y¥xi+K2+1 + [u - b ]Yk > 0 
implies 
(10) g(K2) = [H(K2) c5 plļagisKo E [u T 6, Jax, > 0. 


Proor. Inequality (9) implies yx, x24, > 0, which implies 


Oi4K279,  Yx141> 0 
and, hence, agı > 0. Thus 


[H(K2) ~ a] + [»- 6,J—“2+- > 0, 


YK14+K2+1 
which implies 
[H(K2) — plagiixe + [u — Jax, > 0, 
since, by Lemma 1, 
YK1+1%K14+K2 


21.0 
YK14+K24+18K1 


BAYESIAN MULTISTAGE DECISION PROBLEMS 293 


The next lemma makes use of the following fact. Suppose u < 6,. Let Qg, be 
the conditional distribution of N given that N > K1. Let K2’ = K2'(Qx,, K1) 
maximize U(K1, K2) over K2 with K1 fixed. That is, K? is the smallest 
optimal value of K2, for distribution Q, given K1. Then, in view of Theorem 5, 
K2’ is also the value of K2 that maximizes U(0, K2), 8., IL, Qx,). This holds 
because the K1 observations on treatment 1 do not change II, they only modify 
Q to Qx,. Thus 


(11) MAT U((0, K2), 6, H, Qx:1) g U((0, K2'), 6, I, Qx:1): 

Lemma 3 shows that when Q is regular the maximal utility, U*(K1), for K1 
fixed is no greater than the utility of using the ordered pair (K1 -~ 1, 
K2(Qx,, K1)). This fact is crucial to the proof of Theorem 7 and will be used in 
an obvious way. 

LEMMA 3. Let Q be regular, let 6, > u, and let K1 > 0 be fixed. Then 

U*(K1) = U((K1, K2(Qx1, K,)), 6, I, Q) 
< U((K1 — 1, KX (Qx, K1)), 6,, 0, Q). 


PROOF. 
K1 K1+K2' oo 
U*(K1) = U(K1,K2)=0,¥.a,+p } a,+ H(K2’) a,. 
tw] wm Ki+l1 w= K14+K2+1 
Thus 


U(K1 — 1, KZ (Qx, K1)) — U(K1, K2(Qx,, K1)) 
= [u a b lax, + [H(K2) = plaki ry = g( K2'), 
where g( K2) is defined by (10). 


Define 
U'(K2)= © U(K2|N =i)P(N = iQ) 
t= K1+1 
K1+K2 œ 
=p $ a+H(K2) X œa 
wK1+1 t= K1+K24+1 
For convenience, consider 
K1+K2 œ 
UY(K2)- U*(0)=[g-08] E a, +[H(K2)-8] È a, 
t=K1i+1 t=K1+K2+1 
=[H(K2)-»] X o«,+[n-4] ÈX a 


w= K14+K24+1 w= K1+1 
= g,(K2) as defined by (9). 
Since, in view of (11), K?’ maximizes U'( K2) and since g,(0) = 0, 
K2'(Qx;,K1)>0 implies g,(K2'(Qx,, K1)) > 0. 
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Now, if K? > 0 then g,( K2’) > 0, which implies g( K2) > 0 by Lemma 2. If 
KY = 0 then g( K2’) = g(0) = 0, since H(0) = 8.. Thus g( K2 > 0 for all K2’. 
g 


We are now ready to prove the main two-stage result. Theorem 7 is similar to 
Theorem 2.1 of Berry and Fristedt (1979), who consider sequential allocation of 
treatments. 


THEOREM 7. If Q is regular then K1*(Q) = 0 for all 6, and Tl. 


Proor. If p> 6, then K1* = 0 by Theorem 4. Suppose p < 8.. 


Kl K1+K2 00 


U(K1, K2)=0,} a+ } a, + H(K2) L a,. 
cml t= K1l+1 r= Ki+K2+1 
K2Qx,, K1) maximizes 
K1l+Kz œ 
e È at+H(K2) È a, 
:=K1i+1 1=K1+K2+1 


in view of (11). That is, U*(K1) = U(K1, K2(Qx,, KD). 
Suppose K1 > 0. 


(12) U(K1, K2'(Qx1, K,)) < U(K1 - 1, K? (Qx, K1)) 

by Lemma 3. But 

(13) U(K1 - 1, K2'(Qx,, K1)) < U(K1 - 1, K2(Qx,_,, K1 — 1)) 
= U*(K1-1), 


where the equality holds by (11). 
Combining (12) and (13), we see that 


U*(K1) < U*(K1-1) forall K1>0. 
Since K1 must be at least zero, K1* = 0. O 


There are settings in which K1 = 0 is not uniquely optimal. For some Q, 4,, 
and II it is optimal to never use treatment 2. In such a case any value of K1 is 
optimal. 

The converse of the Berry and Fristedt result is true, while the converse of 
Theorem 7 is false. The following example shows that regularity is not a 
necessary condition for K1* to be zero. 


EXAMPLE 3. Suppose Q is of the form (0, a,0, b,0,0,0,...) where a+ 6=1 
and 0 < b < 4}. Since A = (1,1, b, 6,0,0,0,...), no distribution Q in this class is 
regular, yet K1*(Q) = 0 for all 6, and for all II. This may be shown by process 
of elimination, noting that K1 + K2 < 4 since N < 4 with probability 1. 
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We have now established that for two-stage trials with regular discount 
sequences it is optimal to take no observations from the known treatment in the 
first stage (K1* = 0). This result holds for any (fixed) number of stages. 


THEOREM 8. Let a trial consist of D stages, where D > 2 is fixed before the 
trial begins. If the discount sequence generated by Q is regular then K1*(Q) = 0 
for all 8, and Il. 


Proor. We establish this by considering two cases. For each case we use an 
induction argument on the number of stages. We shall implicitly use the fact that 
a regular discount sequence modified by the fact that N > e (a constant) is in 
turn regular. We will use a subscript to keep track of the number of remaining 
stages. For example, K1% is the smallest optimal number of first-stage observa- 
tions on treatment 1 when there are two stages remaining. 


CASE 1. 6, <p. By Theorem 4, K13 = 0. Assume that K1% = 0. We now 
show that K1%,,, = 0. This requires the examination of three subcases. Note 
that 


Up.1(K1, K2) = DUp,,(K1, K2|N = i)P(N = iQ). 


Consider Up, ,(K1, K2|N = i) and let L denote the number of observations on 
treatment 1 in the first i observations. 


SUBCASE 1.1. ¿< K2. Here 0 < L < K1 andi-— L < K2. Thus 
Up+ı( K1, K2|N =i) = L0, + (i— L) < ip 
= Up,,(0, K2|N =i). 


SUBCASE 1.2. K2<i< K1+ K2. In this subcase 
Up,;(K1, K2|N =i) = L0, + (i-L)p 
< (i — K2)6, + K2(p) 
= Up,,(0, K2|N = i). 


SUBCASE 1.3. K1 + K2 <i. Here 
Up. (K1, K2|N = i) = K1(0,) + K2(u) + Eq [UOT Qari Ka)|N = il, 


where IT’ is the posterior distribution of @, after the first K1 + K2 observations 
and @xi4x2 is the conditional distribution of N given that N > K1 + K2. 
When K1 = 0 we have 


Up+:(0, K2|N = i) = K2(n) + En [UCT Qx2)|N = i]. 
The induction hypothesis implies that 
Eq (US, Qxa)|N = i] = K1(6,) + En [USC Oxia xa)IN = i]. 
That is, with D stages remaining, it is optimal to take no observations on 
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treatment 1 in the first of the remaining stages. Hence 
Upi;(K,, K2|N =i) < Up, ,(0, K2|N = i). 
For all three subcases we have shown that 
Up.,(K1, K2|N =i) < Up,,(0, K2|N =i). 


Thus Up, ,(K1, K2) < Up,,(0, K2), which implies that K14,,, = 0. This com- 
pletes the induction argument for Case 1. 


CASE 2. @, > p. The proof of Theorem 5 generalizes easily. Thus, it is clear in 
this case that, for any number of stages, it is optimal to take all first-stage 
observations on treatment 1 before any first-stage observations on treatment 2. 
We adopt this convention. 

By Theorem 7, K1$ = 0. Assume that K14,=0. Let r denote the best 
strategy among those that dictate that K1 first-stage observations be taken on 
treatment 1. Then Ujt,,(K1) = up,,(r). We shall show that U#,,(K1) < 
Už, (K1 — 1). To do this, we shall employ the same ideas used in the two-stage 
setting of Theorem 7. As in the two-stage setting, let K2’ = K2’(Q,x,, K1) be the 
smallest optimal value of K2, for distribution Q, given K1. Then Uf, (K1) = 
Up, (K1, K2’. 

Consider a strategy 7’ that differs from r in only one respect: Let +’ begin by 
taking K1 — 1 first-stage observations on treatment 1. From observation K1 on, 
let 7’ assign to patient i that treatment that 7 assigns to patient i + 1. Further, 
let r’ dictate the same stage lengths as those of 7 for all stages subsequent to the 
first. 

The induction hypothesis guarantees that all observations taken in stages 
other than the first and the last will be on treatment 2. The total number of 
observations on treatment 2 during stages one through D will, in general, be a 
random variable depending on K 2; call it J. 

Let R(X,) = E[max{6,, E(6,[K,,)}|J]. Generalizing the ideas in the proof of 
Lemma 3, we have 


Upai(t’) — Upai(t) = [u - 0 ]axi + EL R(Ky) - vlexis 
= h( KX), say. 
We wish to show that h(K?) > 0 for all K2’. Now, K2’ maximizes Uj}, (K2) 


where 


Usa(K2)= È Upas(K2\N =i) P(N = i). 


w= Ki+1 
Thus 
Kit+d oo 
Ug. (K2) = U}, (0) =E L atu -O|+E a a,[ R(X,) = A 
rt=Kl+1 w= Ki+J4+1 


= E[R(X,) — el Yxisgsi + Le OG] reise: 
= h,(K2’), say. 
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When necessary we have taken expectation with respect to the random vari- 
able J. 

Now, A,(0)=0. Hence K2’>0 implies A,(K2)>0. If K2’=0, then 
h( K2’) = h, K2’/) = 0. If KY > 0, then h,(K2’) > 0, which implies h(K2’) > 0. 
This follows from Lemma 2 with J in place of K2. 

Thus A(K2’) > 0 for all K2’, which implies Up, (7) = Up.,(7). Clearly 
Už, (K1 — 1) 2 U(1’). So we have 


Us. (KI) = Up.i(7) = Up+1(7') s U3, (K1 = 1). 


This holds for all K1 > 0. Thus K1$,; =0, which was to be shown. This 
completes the induction argument for Case 2. 
We have now shown that for either possible case K1},,, = 0.0 


COROLLARY. For any (fixed) number of stages, for any 6,, and for any Il, if 
the discount sequence generated by Q is regular then it is optimal to use 
treatment 1 in the last stage only. 
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OPTIMAL PROPERTIES OF THE BECHHOFER-KULKARNI 
BERNOULLI SELECTION PROCEDURE? 


By RADHIKA V. KULKARNI AND CHRISTOPHER JENNISON 


SAS Institute, Inc. and University of Bath 


In a recent article Bechhofer and Kulkarni proposed a class of closed 
adaptive sequential procedures for selecting that one of k>2 Bernoull 
populations with the largest single-trial success probability. These sequential 
procedures which take no more than n observations from any one of the k 
populations achieve the same probability of a correct selection as does a 
single-stage procedure which takes exactly n observations from every one of 
the & populations. In addition, they often require substantially less than a 
total of kn observations to terminate sampling. Amongst cther problems, 
Bechhofer and Kulkarni considered the problem of devising a procedure 
within this class which minimizes the expected total number of observations 
to terminate sampling. For ther proposed procedure they cited several 
optimality properties for the case k = 2 and conjectured additional optumal- 
ity properties for the case k > 3. 

In this article we use a new method of proof to establish stronger results 
than those cited by Bechhofer and Kulkarni for the case k = 2, and prove 
stronger results than those conjectured for k 2 3. We also describe a new 
procedure for k > 3 and prove that it minimizes the expected total number of 
observations to terminate sampling when all of the success probabilities are 
small. 


1. Introduction. Let II, (1 <i< kk) denote k > 2 Bernoulli populations 
with corresponding single-trial “success” probabilities p,, We denote the ordered 
values of the p, by pu < -*: S Pray Let p=(p,,---, P) and p= 
(Prip-+++» Prey: The pairing of the II, with the pp, (1 <i, J < k) is assumed to 
be completely unknown. Thus for given p, p has probability 1/k! of being any 
particular permutation of p. The goal of the experimenter is to select as “best” a 
population with success probability p,,;; when such a population is selected we 
say that a correct selection (CS) has been made. 

Define N, to be the number of observations taken from the population 
associated with p,,, at the termination of sampling and N = was to be the 
total number of observations taken. For a given value of p we denote the 
expected value of N using procedure # by Eg{N|p}. 

We shall define a class @ of procedures, all of which achieve the same P{CS}, 
uniformly in p, as does the single-stage procedure which takes n observations 
from each population and selects the population with the largest number of 
successes, breaking ties at random. The procedures in @ share a common 
stopping rule and terminal decision rule but use different sampling rules; they 
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take no more than n observations from any one population and curtail sampling 
as soon as is possible without decreasing the P{CS}. 

We are concerned with finding procedures in @ which minimize E{Njp} 
and/or minimize E{D*_,A,N,{p} where A, > --- 2A, 2 0. (This latter goal 
generalizes to k > 3 the goal of minimizing E{N |P}, the expected number of 
observations from the “inferior” population, when k = 2.) This approach con- 
trasts with that of comparing the performance of procedures which guarantee a 
certain P{CS} requirement in, say, an indifference-zone formulation, in that by 
restricting attention to procedures in € we compare procedures which guarantee 
exactly the same P{CS} uniformly in p. 

If p is known or if a prior distribution for p is specified, an optimal procedure 
within @ for a particular goal exists and can be found by backwards induction. 
However, calculation of such a procedure becomes prohibitively expensive as n or 
k increases. We shall show that certain simply-described procedures are optimal 
for particular regions of p space. This allows an experimenter to use the 
procedure which is optimal for an initial estimate of p, with the option of 
switching to a different procedure as further information on P is obtained. 

Bechhofer and Kulkarni (1982a) state several theorems regarding the optimal- 
ity of one procedure in @ for k = 2. These theorems are proved in Kulkarni 
(1981) and the performance characteristics of this procedure and of a generalized 
procedure for k > 3 are described in Bechhofer and Kulkarni (1982b), Bechhofer 
and Frisardi (1983), Percus and Percus (1984), Jennison (1984) and Kulkarni and 
Kulkarni (1985). In the present paper we use a new method of proof to strengthen 
results for the case k = 2 and obtain new results for k > 3. The same method is 
used in Jennison and Kulkarni (1984) to derive optimal procedures for the 
problem of selecting the s “best” populations, where 1 < s < k — 1. 

In Section 2 we define the class of procedures ¥. Sections 3 and 4 are 
concerned with the case k = 2. In Section 3 we define the procedure #* and 
prove that it minimizes E{ N|p} among procedures in @ whenever puy + Pig = 1 
a conjugate procedure #* minimizes E{N[p} whenever pu; + Pp, <1. In 
Section 4 we show that #* also minimizes E{ N, |p} in a specific region of the 
parameter space and give conditions for #* and #* to minimize the expected 
total number of failures. 

Section 5 extends the above results to the case k > 3. We prove that the 
generalization of #* proposed by Bechhofer and Kulkarni (1982a) minimizes 
E{N\p} in a specific region of p-space where all the piy Q sis k) are 
sufficiently large. We define a generalization of #* and prove that it minimizes 
E{N\p} in another specific region of p-space where the py (1 <i <k) are 
sufficiently small. The goal of minimizing the expected number of observations 
from an inferior population is generalized to the case k > 3 and we prove that 
@* is optimal in this respect over a specific region of p-space; in particular, 2 * 
minimizes E{L#_,A,Ny[p} for any A, = ++: 2A, = 0 whenever pry + Pro] 2 1. 


2. A class of closed sequential selection procedures. A selection proce- 
dure is characterized by a sampling rule, a stopping rule, and a terminal decision 
rule. We shall use the following notation in specifying such rules: when a total of 
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m observations have been taken the experiment is said to be at stage m, and the 
state of the experiment at this point is the complete history of the first m 
observations. This is denoted by y,, = {(z,, 01), (Zo, P2) -- -s (Ems Um}, Where i, 
is the index of the population from which the rth observation is taken, and 
v, is the outcome of the rth observation (either a success or a failure); the 
notation y without a subscript will be used to denote a general state of an 
experiment. The fgllowing variables are functions of y,,, although the dependence 
on m is suppressed in the notation: 


, = number of observations taken from II, in the first m stages, 
8, = number of “successes” yielded by IJ, in the first m stages, 
f, = number of “failures” yielded by II, in the first m stages. 


We restrict attention to procedures which take no more than n observations 
from any one of the k populations where n > 1 is a prespecified integer; thus 
O0<msknandn,<n(1<i<k). Let the set of all possible states be 


k 
Q= [yi smskmmsnisis h; Èn = m). 
r= l 

A sampling rule, &, is a (possibly random) function from & te {1,2,..., k} which 
specifies the index of the population from which the next observation is to be 
taken. 

The stopping rule * is: “Stop at the first stage m at which there exists at 
least one population II, such that 


(2.1) 8,28,+n—n, forall 7 #%.” 


The terminal decision rule %* is: “Select as best the population II, which 
satisfies (2.1). If more than one population satisfies (2.1) then select one of these 
populations at random.” Since we shall restrict attention to procedures using the 
stopping rule .*, the notation y„ will be used to refer only to states which could 
arise under the stopping rule J *. 

We shall consider the class ¥ of sequential procedures which use the stopping 
rule “*, terminal decision rule J *, and arbitrary sampling rule £, which takes 
no more than n observations from any one population; we write #= 
(R, £*, F*) to denote the procedure in € using the sampling rule &. 


THEOREM 2.1. If P, and P, are in € then P{CS|p} is equal for P, and 
P, uniformly in Ð. 


This theorem is proved in Bechhofer and Kulkarni (1982a) and is a special case 
of a more general theorem proved in Jennison (1983). A consequence of the 
general theorem is that the P{CS} for the single-stage procedure of Sobel and 
Huyett (1957) equals the P{CS} for any Pe Ẹ, uniformly in P. 


REMARK 2.1. The stopping rule “* terminates the experiment as soon as 
one or more populations which will have the most successes in their first n 
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observations can be identified. To stop sampling any earlier would decrease the 
P{CS}. Hence, a procedure which is optimal within @ for some objective 
function is also optimal within the class of all procedures taking at most n 
observations from any one population, that achieve the same P{CS} as the 
single-stage procedure, uniformly in p. 


REMARK 2.2. For k = 2, if @ is play-the-winner sampling the procedure 
(@, £*,F*) is the same as that proposed by Hoel (1972); however, this 
procedure is not optimal for our problem. 


3. Minimizing the expected total number of observations: k = 2. We 
seek a procedure which is optimal within the class @, in the sense that it 
minimizes E{N\(Pop Pj2})}. We first consider the case in which Puy < Prg are 
known but the pairing of II, II, with Prp Prg is unknown. Let w € {1, 2} 
denote the state of nature, where o=1 if (py P2) = (Ppp Puy) i.e., IL, is 
“best”, and w = 2 if (p,, P2) = (Puy ior i.e., aa is “best”; the prior debi: 
tion for w is given by P{w = 1} = P{w = 2} = 4. 

For k = 2, the complete history, Ym, n stage m can be summarized by 
X( Ym) = (81, fi Ris S95 hes no). Here, 8, + f =n, (i = 1,2) and the stopping rule 
£* given in (2.1) can be written as: “Stop as soon as 


(3.1) 8,+f,=n (select II,) 
or 
(3.2) Sot f,=n (select IT,).” 


For given P = ( Prp Pig), Bayes optimal procedures which minimize E{ NIP} 
can be found by backwards induction; Kulkarni (1981) used dynamic program- 
ming to construct such procedures. It can be seen from the backwards induc- 
tion argument that there are Bayes optimal procedures with nonrandomized sam- 
pling rules. In fact, there are Bayes optimal procedures whose sampling 
rules depend on x(¥,,) only, but we shall make use of the more general form 
of procedure in proving Theorem 3.1 below. For a nonrandomized procedure 

= (&, S*, F*) E E we denote by d( Ym, #) E€ {1,2} the index of the popula- 
tion to be sampled next when in state y,,, if the experiment has not yet stopped; 
since a procedure in @ is determined by its sampling rule we use the notation 
d( y, #) and d(y, P) interchangeably. 

The procedure #* = (#*, S*, F *) defined in Kulkarni (1981) and Bechhofer 
and Kulkarni (1982a) uses the following sampling rule which is also the Least 
Failures Rule defined by Kelly (1981). 


DEFINITION OF SAMPLING RULE @*. “Sample II, next when f, < f} or 
(fi =f, and s, > s3). Sample IJ, next when f < fi or (fi =f, and s, > s). 
Sample IT, or H, next with probability } each if s, = s, and f = fy.” 

We now define a subclass @* of the nonrandomized procedures in @. Let 
Qg = {7} 8, = So, fi = fa) and let 2, = Q\ Qp. Procedure F is in @* if it uses 
the stopping rule “*, the terminal decision rule 7* and a nonrandomized 
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sampling rule ® for which d( y; #) = 1 when f, <fa or (fi = fp and s, > £) 
and d( y; #) = 2 when f, < f, or (/, = fa and s, > $,); there is no restriction on 
d(y; R) when y € Qp. The notation d( y; *) will be used to denote the above 
function for y € Q4. Note that #* differs from the procedures in #* only in 
that it randomizes between II, and II, for y E€ Qp. 


THEOREM 3.1. Amongst all procedures in %, those in €* are optimal, in the 
sense that they minimize E{N|( Pup Prg)} whenever poj + Pig = 1. 


METHOD OF PROOF. For a particular pair P = (P11), Prg) With Pry + Pg = 1 
we take a nonrandomized Bayes optimal procedure P € @ and modify it by steps 
to obtain a procedure in @*. We show that at each step, E{N|p} decreases or 
remains equal (in fact, it must remain equal), and thus the resulting procedure is 
optimal for P. It is easily seen that E{N/p} is the same for all procedures in @* 
and hence, all procedures in @* are optimal for all p with puj + Pig > 1. 


REMARK 3.1. It is easily seen that Ey.{Nip} = Eg{N|p} for all Pe ¢*; 
thus #* = (#*, £*, J *) is also optimal among procedures in @, whenever 
Poy t Pp = 1. 


In order to prove Theorem 3.1 and for the proofs of later results, we require 
the following lemmas: 


LEMMA 3.1. For a given nonrandomized procedure P € C and a given state 
Ža, suppose that after being in state ¥,, it is impossible to stop, using P, without 
eventually taking an observation from II,. Then there is a nonrandomized 
procedure P’ € Ẹ such that 


d( Imi FP’) = A Ini P) form < m, 
d( da; F’) = d( us P) for dm E Far 
d( jn; F’) =i 
and Ey(No|B} < Es(NolP} and Ey{No|B) < Eo( NIB} for all Ð. 


Proor. (i) If d(},; P) = i, take P’ = 2P. 

(i) If d(Y; F) + i, define Z’ as follows: “Take the first m observations 
using the sampling rule of #, and if %, # Xa continue to use P. If ya = Yam take 
an observation from II, and then continue to use the sampling rule #, ignoring 
the observation on II, until # calls for an observation from II,; now use the 
sampling rule #, behaving as if the earlier observation from II, had been taken 
at this point. At all times use the stopping rule “*, based on all of the 
observations that have been taken up to that stage.” 

For a fixed pair of sequences of observations on IT, and I, # must take at 
least as many observations as P’ on each of II, and IT, and the result follows. O 
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LEMMA 3.2. For a given nonrandomized procedure P € © and a given state 
Fa € Qp (i.e., 5, = 5 and f, = f2) there are nonrandomized procedures P, and 
P, € € such that 


d( Ym; P) = d( Yp; P.) = d( Yn; P) form < m, 
d( ym Pi) = d( Yai Pa) =A Hus FP) for yg # Hus 
d( Yn; P) = 1, dl Jn Po) = 2 


and Ea{ No iP} = Eg {Nol} = Eə{ NiP} and EÍ No lP} = Epi No lP} = 
Eo{No |p} for all P. 


Proor. We construct a procedure #, with the required properties. A proce- 

dure #, can be constructed similarly. Let i° denote the complement of i in 
1,2}. 
' A. If d( ¥4; P) = 1, take F| = 2. 

(ii) If d( Ja, F) = 2, define P, as follows: “Take the first m observations 
using the sampling rule of #, and if yy + ¥,, continue to use 2P. If y4 = Ym take 
d( Ym Pa) = {d ypa F)}° for m > m, where y/, is formed from y„ by changing i, 
to if for all r > m+ 1.” Thus, when Ya = ža #, behaves like # would behave 
if the labels of the two populations had been interchanged after stage m. By 
symmetry, Ea {No |P} = Eo{ Nip} and Ea{NolP}) = Ea{No lP} for all p. 0 


Let Ro = {y E€ Qa: 8, + fp =n-— lors, + f,=n -— 1}. Thus Qo is the set of 
states for which either s, # s, or f, # fọ, and the next observation will be the last 
one if it takes the appropriate value. 


LEMMA 3.3. For a given nonrandomized procedure PE €, define P’ by 
d(y; P')=d(y; P) fory EQ, 
d(y; P’) =d(y; €*) fory € Qo. 
Then Ep{N|Ð} < Eə{ NIP} whenever Py) + Py 1. 


Proor. Without loss of generality we consider a state ¥ € Qo with x(¥) = 
(815 fis Ri; 52, fe, Tig) and for which 5, + f, = n — 1. If on any future observation 
a success is obtained from IJ, or a failure from IJI,, then the experiment 
terminates. Thus from this point, the application of a particular sampling rule is 
determined by the fixed sequence, containing (n — 7,) 1’s and (n — 7.) 2's which 
give the order of the population indices for those of the remaining (2n — n, — Ng) 
possible observations which need to be taken. Note that under the stopping rule 
S/* the 2nth observation is never taken, i.e, N < 2n — 1. 

Suppose there are m, 1’s and m, 2’s in the first r terms of such a sequence 
(m,+m,=r,n,+N,+rs2n—- 2). Then for wo =1, (p, De) = (Pep Pu 
and, conditional on the occurrence of state Y, 


P{(N> ñn +ī,+r}=(1 — Pm) (Po ™. 
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This expression decreases as m, increases if pry + Prg = 1. Similarly, for w = 2, 
P(N >ñ +A,+r}=(1- ppm) (Po) ™, 


which also decreases as m, increases if pr) + Pro > 1. H the experiment has not 
stopped by stage 2n — 2, it does not matter which population is sampled next 
since the next observation will be the last. Thus, if puj + Prg = 1, an optimal 
sampling rule from state Y on is to take observations from II, when possible; #’ 
therefore samples optimally from state y on and the result follows. 0 


Lemma 3.4. For a given nonrandomized procedure PE € such that 
d(y; P) = dy; @*) for all y E€ Qo, suppose there is a state Ja E Qa with 
(Fig) = (Bs f ys S2 fas Fa) and d( 3p; P) + A Fp €*). Then there is a non- 
randomized procedure #' € € such that 


d(y; P’) =d(y;€*) forally = Qo, 
d( Imi P') = d(Y P) form < m, 
A Im; P) = (ns P) for Ya + Ins 
d( Fins P’) = U( Ins €*), 
and Ey {N\|p} < Eg{N|p} when pny + Pig 2 1. 


PRooF. First, note that the procedures #’ constructed in Lemma 3.1 can be 
modified by application of Lemma 3.3 to give procedures #” for which 
d(y; P”) = d(y; @*) for all y E Qo, with no increase in E{N|p}. Also, the 
procedures #’ constructed in Lemma 3.2 preserve the property d( y; P) = 
d(y; @*) for all y € Qo. To prove the lemma we consider the following cases: 


Case la: 5, > 5, and f, > fy. 

In this case d( Ya; @*) = 2. Neither (3.1) nor (3.2) can be satisfied without a 
further observation from IT, and the required procedure #’ can be constructed 
by application of Lemmas 3.1 and 3.3. 


CASE 2a: 5, < 5, and fi > jp. 

In this case d( Jp; @*) = 2. Since d(y; F) = d(y; ¢*) for yEQo, it is 
impossible to terminate after being in state ¥,,, without an observation from II,. 
The required procedure 2’ can therefore be constructed by application of 
Lemmas 3.1 and 3.3. 


CASE 3a: 3, > 8 and fi = fy. 

In this case d( Jp; @*) = 1. First define F, as in Lemma 3.2, so that if the 
next (5, — 5,) observations are taken from Il, and they are all successes, then 
the next observation is taken from II}. Since d( y; F.) = d(y; @*) for y E Qe, it 
is impossible to terminate under #,, after being in state ¥,, without an observa- 
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tion from IT, and the required procedure #’ can be constructed from P, by 
application of Lemmas 3.1 and 3.3. rn ape 

The proofs for Case 1b: 5, > §, and fẹ > fı, Case 2b: 5, < 8, and f, > fi and 
Case 3b: 5, > 5, and f, = f are as above with the indices 1 and 2 interchanged. 
These six cases cover all y}; € Q, and the lemma is proved. O 


PROOF OF THEOREM 3.1. The proof uses Lemmas 3.3 and 3.4. Suppose PE € 
is a nonrandomized Bayes optimal procedure for a given P = (Pup Prg) with 
Puj + Peg 2 1. Then by Lemma 3.3, P can be modified to give a procedure F’ 
for which d( y; P’) = d( y; @*) for all y E Qo and, in turn, using Lemma 3.4, P’ 
can be modified gradually, starting with m= 1 etc., to obtain a procedure 
P” © €* for which Ey.{N[p} < Eə{ NIP}. It follows that all procedures in ¢* 
are Bayes optimal for this problem and since the procedures do not depend on P 
they are optimal whenever puj + Pig 2 1. O 


The conjugate procedure #* = (#*, S*, F*) uses the following sampling 
rule. 


DEFINITION OF SAMPLING RULE #*. “Sample II, next when s, <8, or 
(s,; = 8, and f, > f,). Sample II, next when s, > s, or (8, = 8, and fi <f) 
Sample IT, or II, next with probability t each if s, = s and f, = fa” 


It is a consequence of Theorem 3.1 that #* and all procedures P, which agree 
with #* when s, # 8, or fı # fọ, minimize E{N|p} whenever puj + Pig < 1. 
For k = 2, we denote by @* the class of nonrandomized pr ures PE F, 
which agree with #* when s, # 8, or fi # fa- 


REMARK 3.2. The sampling rules for procedures in ¢* and €* do not 
depend on n. 


4. Minimizing the expected number of observations from the inferior 
population and the expected total number of failures: k = 2. 


THEOREM 4.1. Amongst all procedures in €, those in @* minimize 
E{Na|( Ppp Pt2)}, the expected number of observations from the inferior popu- 
lation, if and only if 

= -1 
(4.1) Piz = max{(1 T Pey Pm) (1 +1- Pw)/(1 -= Pig )) i: 


This condition reduces to pig = [3 — Py; — {(3 — Pi)? — 4371/2; a sufficient 
condition is Pig = 3. 


METHOD oF PROOF. The proof is similar to that of Theorem 3.1. The 
following lemma corresponds to Lemma 3.3. 
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LEMMA 4.1. For a given nonrandomized procedure F € €, define P’ by 
d(y;?')=d(y;P) fory€Qo, 
d(y; #’)=d(y;€*) fory € Qo. 
Then Ep {NaP} < Eg{Nay|p} if (4.1) is true. 


Proor. Without loss of generality we consider a state ¥ € Qo with x(¥) = 
(5,1, fis Ai So fos A n,) and for which 5, += n— l and n,<n (¢=1,2); it 
follows that 5, > 5, and /, < f} with at least one strict inequality. As before, a 
success from II, or a failure from II, leads to termination of the experiment, and 
the application of a particular sampling rule from this point, is determined by the 
fixed sequence, containing (n — 7,) 1’s and (n — n,) 2’s which gives the order of 
the population indices for the remaining observations. Denote this sequence by 
z = (i, to, tg,...). Under #’, the sequence consists of (n — 7,) 1’s followed by 
(n — ñ.) 2’s. The sequence z corresponding to P can be transformed into that 
corresponding to F’ by successively interchanging pairs of elements in z. This 
may be done in such a way that each transition is of the form z, to z, where the 
sequences z, and z, consist of 


(i) a sequence containing a, 1’s and a, 2’s, for some a, and a», which is the 
same for both z, and Zo, followed by 
(ii) (1,2) in z; and (2,1) in zg, and then 
(iii) the same final terms in both z, and z}. 


Let E {NŽ IP} denote the expected number of further observations on the 
inferior population, starting from state y with sampling according to the se- 
quence z. We shall show that for z, and z, as described above, E,{N@)|p} < 
E, {NÈ Ip} if (4.1) holds; hence Eg,{NZ,|p} < Eg{Néy|p} and the lemma follows. 

“Tf the difference between the sequences z, and z, is in the last two elements, 
then the only difference in sampling occurs at a state y € Q, and hence 
E,{ NIP} < E. {NĀ Ip}. Suppcse now that the difference in the two sequences 
occurs earlier. Note that this implies a, < 5, — 5, and a, < fy — f, with at least 
one strict inequality. Under w = 1, IL, is the inferior population and by consider- 
ing the pairs of sequences of observations from II, and II,, which lead to 
different values for Ng, under 2, and z,, we have 


E,{NQIp, © = 1} - E,{NGIB, o = 1} = (1 - Pa) (Pp) Pe 
Similarly, 
E,{NG DB, a 2} a E,{NQID, o= 2} az -(1 = Pu) (Pma T Pig) 
The posterior distribution of w in state y is given by P{% = 1} =a, 
P{w = 2} = a, where a, + a, = 1 and 


a/a, = (PrP) (1 — Pm) - Py) 
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Hence 
E, {Nap} = E. {NAIP} = a,(1 — Ppa) (Pu) Pe 


-a(1 -pm) (Pa) a -= Pin) 
and this is positive if 


a (1 = Pr) (Pup ” | 
a,(1 — pm) Pm)” 


3, —5,-a h-ħh-a) ~! 
Pp i 2 2 
=l1+ (zm) f 
Pij 


But 5, — 8 — @,2 0 and f, — fı — a, = 0 with at least one strict inequality; 
thus (4.2) holds for all the required y if it holds both when 8, — 8, — a, = 1 and 
f. ~ {, - a, = 0 and when s, — 5, — a, = 0 and fa — fi — a, = 1. Substituting 
these two cases into (4.2) gives condition (4.1). 0 


Pp} = h + 
(4.2) 


1 -~ Po 
l = Piz 








Proor oF THEOREM 4.1. The proof that procedures in @* minimize 
E{N |p} if (4.1) holds uses Lemmas 3.1, 3.2, and 4.1 in the same way that 
Lemmas 3.1 to 3.3 were used to prove Theorem 3.1. 

The necessity of condition (4.1) is proved by showing that if it does not hold, 
procedures in @* do not minimize E{ Ng |p} for certain states y. The superior 
sampling rules are found by interchanging a pair of elements in the vector z 
discussed in Lemma 4.1. O 


Let NF = E{X?_,(1 — py)N. lP} the total number of failures from both 
populations at termination. 


THEOREM 4.2. Amongst all procedures in €, those in @* minimize 
E{N*\( pry Pig) if and only if either pny + Pig) 2 Lor Pay + Pp) <1 and 


2- 4pm + pi- (l - py ~ 4p + Phy) 





(4.3) Pig 2 


Proor. The proof is similar to that of Theorem 4.1. With state y and 
sequences z, and z, as described in Lemma 4.1, let E,{NZ|p} denote the expected 
number of further failures, starting from y, sampling according to the sequence z. 
Then 


(4.4) E, { N#\p} = E, (NÈIP} = a, (1 =; Pry) (Pwa — Pry) (2P py = 1) 
+ (1 — pay) (Pp) (1 — Pp) (2P 1). 


This expression is positive if Prj + Prg > 1. Consideration of the cases {8, — 
8, — a= 1 and f, — fı — a, = 0} and {5, ~ 8, — a, = 0 and fy — fy — a, = 1} 
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gives the necessary and sufficient condition for (4.4) always to be positive when 
Py + Pia < 1, which reduces to (4.3). 0 


THEOREM 4.3. Amongst all procedures in ©, those in @* minimize 
E{ NFI Pim» Pe) 4 Pio S i 


Proor. The proof is similar to that of Theorem 4.2. Since procedures in ¢* 
are associated with the sequence z,, they minimize E{N"|p} for values of piy 
and prg for which (4.4) is always negative. A sufficient condition for this is 
Pr S + but the necessary condition, obtained by substituting extreme values of 
a, and a, in (4.4) depends on n. O 


5. Extensions to three or more populations. In this section we extend the 
results of Sections 3 and 4 to the case k > 3. We continue to restrict attention to 
the class @ of procedures using the stopping rule * defined by (2.1), the 
terminal decision rule 7 *, and a sampling rule which takes at most n observa- 
tions from any one of the k populations. We first generalize the classes €* and 
@* fork > 3. 

Let O, be the ordering in which II, precedes II, if f, < f, or if f, = f, and 
s, > s,, and II, is tied with II, if both f, = f, and s, = s,. Let A, denote the set 
of populations tied in first place under O,. @* consists of those nonrandomized 
procedures in @ which always take the next observation from a member of A,: 
the procedure #* = (#*, £*, F*) of Bechhofer and Kulkarni (1982a) chooses 
one of these populations at random. 

The ordering O, is defined as follows: first place is given to the population 
with most successes, with ties broken according to the smallest number of 
failures; if there is still a tie, then all of these tied populations are tied for first 
place. Let r be the number of populations tied for first place; then the remaining 
(k — r) populations occupy places (r + 1) to k and are ordered as in O,. If r = 1, 
let A, denote the population in second place under O,, or the set of such 
populations if there is a tie; if r > 1 and the (r + 1)st population under O, has at 
least as many failures as the populations tied for first place, let A, denote the set 
of populations tied for first place under O,; if r > 1 and the (r + 1)st population 
under O, has less failures (and therefore less successes) than the populations tied 
for first place, let A, denote the population in {r + 1)st place under O,, or the set 
of such populations if there is a tie. @* consists of those nonrandomized 
procedures in @ which always take the next observation from a member of A,. In 
this context the natural generalization of #* is the sampling rule which chooses 
one of these populations at random. This should be distinguished from the 
generalization &* given by Bechhofer and Kulkarni (1982a). Although, at first 
sight, @* and @* appear to have quite different forms they are in fact members 
of a single family of procedures; see Jennison and Kulkarni (1984). 

We note that both ¢* and @* as generalized above, agree with _ the previous 
definitions for k = 2. We shall prove that procedures in ¢* and @* minimize 
E{ N\|p} in two different regions of the parameter space {P = (Puy. -s Prap- 
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When k > 3 the concept of elimination is useful. Populations are eliminated 
successively: if {,+s8,2>n and H, is not yet eliminated, then we say II, 
eliminates II; once a population has been eliminated it is unnecessary to take 
any further observations from it since at best it can only tie with its eliminator 
for the most successes. 


THEOREM 56.1. Amongst all procedures in €, those in @* minimize 
E{N\( Pup- Pray} if Pi + Œt Pry) /R spal 


REMARK 5.1. We note that this is a stronger result than the conjecture of 
Bechhofer and Kulkarni (1982a, Conjecture 7.1), namely that #* minimizes 
E{ Np} for pry + Pgz 1. 


METHOD OF PROOF. The proof is similar to that of Theorem 3.1. The major 
difference occurs in the following generalization of Lemma 3.3. 


Lemma 5.1. Suppose that all but two populations have been eliminated, with 
populations II, and II, remaining; suppose also that s, + f,=n-—1,8,+f, <n, 
s, +f, <n, and either s, > s, or f, < f, Then if pay + Eha Prk -)e21, 
an optimal sampling rule from this point is to sample from population Il, until 
the first success, switching to II, only if n — (s, + f,) failures are obtained from 
Ti 

Proor. Using the notation of Lemma 4.1, suppose that in the above situa- 
tion, a procedure # which does not sample from eliminated populations gives rise 
to a sequence z of is and js which contains the pair (j,i) as two consecutive 
elements and that these are not the last two elements in the sequence. Let y 
denote the state of the experiment when the procedure P is about to take the 
observation on II, corresponding to the “j” of the pair (j,i). Define the 
procedure #’ which agrees with # except that the pair (j, i) is replaced by 
(i, j). For fixed p, and p,, if N? denotes the total number of further observations 
starting from ¥, then Ep, {N\p} — Eg{N?|p} = 1 — p, — p, Let a, ,, be the 
posterior probability, when in state ¥, that the two populations, Il, and II, 
which have not been eliminated, have success probabilities pr} and Pij 
respectively. Then 


(5.1) Eg {NIP} - Eo{ NID) =E E aah- Pug- Prat) 
Fh 


For given p, the prior distribution for p assigns probability 1/k! to each 
permutation of P. In state y, II, has at least as many successes and at most as 
many failures as each other population, with at least one strict inequality in each 
case: thus, a, ,, is an increasing function of , for fixed i,, and the right-hand 
side of (5.1) is negative if py + Œf2Prn)/(k — 1) 2 1. It is easily seen that 
there is no advantage in sampling from eliminated populations and the result 
follows. O 
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PROOF OF THEOREM 5.1. Let # be a Bayes optimal nonrandomized proce- 
dure for a particular p with puy + Œf 2Pi)/(k — 1) 21. Then # can be 
modified to agree with @* in the situations described in Lemma 5.1, with no 
increase in E{ NIP}; it can then be seen that in any state a future observation 
from a population in the set A, is essentially inevitable (the possibility of a new 
population tying with those in A, is handled using a generalization of Lemma 
3.2) and hence # can be modified to give a procedure in @*, again with no 
increase in E{ Np}. It follows that all procedures in €* minimize E{N|p} for 
the specified p. O 


THEOREM 5.2. Amongst all procedures in €, those in @* minimize 
E{N|( Pup- Pik) if 


k-2 ı k—1 
(5.2) Peal +E I] (1 -P.»)| Ss II (1 — Piy); 


wy] m1 


values of Ð satisfying this condition include those for which either Et py + 
(k — 1)Prr S LOF pig < 1- GVA”. 


METHOD OF PROOF. Again, the proof is similar to that of Theorem 3.1. The 
following lemma replaces Lemma 3.3. 


LEMMA 5.2. Suppose that in state ï there is a unique leader in the ordering 
O,; without loss of generality let this be population Il,. Suppose also that 
8, +/f,2n-—1 for i#1 so that a single success from II, terminates the 
experiment with the selection of II,. Then if (5.2) holds, there is an optimal 
procedure for minimizing E{N|p} which does not take the next observation from 
T. 


Proor. First consider a state y for which no o have been eliminated 
and for which f, = f, for all j # 1. In this case f, = n — — 1, (n — 1) observa- 
tions have Deen taken from II. and n — (r, + 1) eee say, have been 
taken from II, (j # 1) where r, 2 1. Without loss of generality suppose 1 < T < 
BI sS ry. Suppose further that # is a Bayes optimal procedure for a given 
p and that in state y, P takes the next observation from IlI. If this observation 
is a success the experiment stops; if not it can be seen by the “inevitable 
observation” argument that # must sample from IT.,..., IL, in order, in each 
case sampling from II, until it is eliminated by the occurrence of a failure or until 
another (r, + 1) successes are obtained from II, and the experiment stops with 
Il, selected as best. We define the procedure #’ as follows: sample as under F 
but omit the initial observation from II,; if r, successes are obtained from II, 
take an observation from II, and then proceed again as under # behaving as if 
the observation from I, had been taken when in state J. Let N? denote the total 
number of further observations starting from y, and let X = E,{N°|p} — 
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Ep {N7|p}. For fixed p = (p,,..., Pp) it can be shown that 
X = (1 — p})(1 — py)... - pr) 


1 — př -př 
-v +0-pp 
— Po 








(5.3) 





+(1 — pz) - Pi) 





+(1 — pè) — px)... (1 — pR- 
As a function of r, with r, fixed ( #7), X is of the form 
eee B- C(1 - py)/(- p,), 
where B > 0. If X 2 0 for r, = 1, then A(1 — p,) — C 2 B, hence 
(AG -p,) - C}(t-pp)/(—p,) > B 
and X 2 0 for all r, > 1. Thus X > 0 for all 1 < r, < r} < +-+- <7, if 


(5.4) ia =p) = pre È E Ia- p)} =o 


122 m2 


a 


For given p = ( Pup---» Prey) the left-hand side of (5.4) is smallest when p, = 
Prep P2 = Pup Ps = Prp.» Pr = Pce-1) and hence (5.4) is certainly satisfied if 
(5.2) holds; this establishes a contradiction and we conclude that an optimal 
policy does not sample from II, when in state y. 

We now consider states y in which no population has been eliminated and for 
which f, <f, (7 #1) and the values f, are equal for all j #1. We argue 
inductively on the value of f,— f, Again suppose there is a Bayes optimal 
procedure # which samples from II, when in state y and use the inductive 
hypothesis to find the optimal sampling procedure when a single failure on 
population IT, has been observed after being in state y. A procedure #’ can then 
be constructed as before for which X = Eg{N°|p} — Ey.{N°|p} is given by (5.3), 
where now r; = n — n, 2 1, and the result follows. 

Finally, we consider states y in which one or more populations have been 
eliminated. The proof follows the same lines using the fact that for given p the 
expression for E,{N°|p} — Eg-{N°|p} is positive as long as X defined by (5.3) 
with m = 7, = ++: =r, = 1 is positive. O 


PROOF OF THEOREM 5.2. Let # be a Bayes optimal nonrandomized proce- 
dure for a known p which satisfies (5.2). In a state where there is a unique leader 
under O,, it is a consequence of Lemma 5.2 that # can be modified so that an 
observation on a population in A, is essentially inevitable and hence # can be 
modified to give a procedure which agrees with #* in such situations, with no 
increase in E{ N|p}. Similarly, if populations II, and II, say, are tied for first 
place under O, and f,=f,< f; Asl<k) then an Gbscrvation on one of the 
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leading populations is essentially inevitable and # can be further modified to 
agree with @* in these situations also, with no increase in E{N|p}. Finally, if 
r > 1 populations, including II, and I1,, are tied for first place under O, and 
f, = f, > fa for some u, then an observation on a population in (r + 1)th place 
under O, is essentially inevitable and # can be modified once more, again with 
no increase in E{N|p}, to agree with @* in these situations also. The resulting 
procedure is in @* and therefore the procedures in ¢* minimize E{N|p} for the 
specified p. 

If DET Pay + (k ~ Dp pay < 1, then TEA — pey) 2 (k - Dpi fori sis 
k — 1, and hence 


(Tra ~Pual) pafi + > jü- rod} 


pel sai 
k-1f{k-i =I 
=P E | MG -pa <1, 
w= gmt 
and thus (5.2) holds. By considering the case puj = Pro} = *** = Pray it is seen 
that (5.2) holds whenever pią S 1 — Gy¥E-05 


We note that the methods of proof in Theorems 5.1 and 5.2 show that 
procedures in @* and @* sample optimally for the specified values of p, starting 
from any state y regardless of whether it can be reached using a procedure in ¢* 
or @*, respectively. M 

The procedures in @* and @* behave in two quite different ways. Those in 
€* are appropriate when the p;,s are large; they aim to reach a conclusion by 
obtaining a large number of successes from the leading population. Procedures in 

* are appropriate when the p,,;8 are small; they aim to reach a conclusion by 
obtaining a large number of failures from every losing population. The sets of 
values of p for which procedures in @* and €* minimize Z{N|p} are larger 
than those given in the statements of Theorems 5.1 and 5.2; further specification 
of these two sets can be found from the details of the proofs of the theorems. We 
showed in Section 3 that for k = 2 the regions of optimality of @* and @* 
together span the entire parameter space; this is not the case for k > 3, and it 
does not seem possible to characterize optimal procedures in this remaining 
region using the techniques of Sections 3 and 4. The disadvantage of procedures 
in @* is that a failure on each of (k — 1) losing populations is needed to do the 
work of a single success on the leading population, and this is a serious problem 
for large k. The problem is less serious if several populations have been eliminated, 
and this suggests that a combination of ¢* and ¢* may be optimal when 
neither is optimal by itself. In practice p usually is not known and we suggest 
that either @* or #* be used as a sampling rule, according to the current 
estimated success probabilities of the uneliminated populations. 

An extension of Theorem 4.1 requires a generalization of the number of 
observations from the inferior population to the case k > 3. In a medical study, 
for instance, one wishes to allocate patients to treatments with high success 
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probabilities but at the same time an early result is desirable and the total 
number of observations taken should also be small. A general objective might be 
to minimize E{D*,A NIP} where A, > =- 2A,20 and \ = (Àj... Àp) is 
possibly a function of p. Special cases of this goal are the minimization of 
E(NolP)}, E{Et No P} and E{NIP}; if A, = 1 — Pup E{ZŁ A, No lÐ} is the 
expected total number of failures, discussed in Section 4 for k = 2. We say that a 
procedure is fully optimal for a particular p if it minimizes E{EiA, N, Ip} for 
all A satisfying A, > --- 2A,20. An equivalent requirement is that the 
procedure minimize EE NIB) for all r (1 < r < k). The following theorem 
shows that P* = (#*, S*, 7 *) is fully optimal if puy + Pg 2 1. 


THEOREM 5.3. Amongst all procedures in €, those in €* minimize 
ECT Nol Pry +++» Pray} for allr (1 < r < k) whenever pu) + prg 2 1. 


METHOD OF PROOF. Again, the proof is similar to that of Theorem 3.1. This 
time the following lemma replaces Lemma 3.3. 


LEMMA 5.3. Suppose ¥ is a state in which all except two populations have 
been eliminated and these have success probabilities p,,, and piy we condition 
on knowing p, and p; but not their pairing with the two populations. Let NÈ 
denote the number of further observations taken on the population associated 
with pi, after being in state Y. 


G) If r < min{i,,i,} then E{L7.1NQ)| Piup Ping) = 9 for all procedures in 
* 


Gi) Ifi <r < i, thena SUITE condition for procedures in €* to minimize 
E{E m NO Ph, P Pty} 18 Pin] ap 

(iii) If max{z,, to} <r then a sufficient condition for procedures in €* to 
minimize EE). NGQ Pi.) Pra} 8 Pin) + Pt) = L 


Proor. The proof follows directly from the 2-population results of Theorems 
3.1 and 4.1.0 


PROOF OF THEOREM 5.3. It follows from Lemma 5.3 that puj + Pig 2 lisa 
sufficient condition for procedures in @* to minimize E{L{_,N?,|p} for 
all states ¥ in which all but two populations have been eliminated and 
for all 1 < r < k. The results of Lemma 5.3 can be applied as in the proofs of 
previous theorems to show that procedures in @* are fully optimal whenever 
Poy + Pi] = 1.0 
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ASYMPTOTICALLY EFFICIENT SELECTION OF THE ORDER BY 
THE CRITERION AUTOREGRESSIVE TRANSFER FUNCTION 


By R. J. BHANSALI 
University of Liverpool 


The autoregressive orders selected by the criterion autoregressive trans- 
fer function (CAT) of Parzen (1974), a new version, CAT’, of CAT introduced 
by Parzen (1977), and the CAT, criterion of Bhansali (1985) are shown to be 
asymptotically efficient in the sense defined by Shibata (1980, 1981). A 
generalization of the penalty function considered by Shibata (1980) is intro- 
duced. The order selected by the CAT, criterion of Bhansali (1985), with any 
fixed a > 1, is asymptotically efficient with respect to this generalized penalty 
function. 


1. Introduction. In an important paper, Shibata (1980) derived an asymp- 
totic lower bound for the mean squared error of prediction of an infinite-order 
Gaussian autoregressive process when the order of the fitted autoregression is 
selected from data. He then used this bound to show that the order selection by 
minimizing the final prediction error criterion (FPE) of Akaike (1970) and the 
information criterion, AIC, of Akaike (1973) is asymptotically efficient in the 
sense that for either of these criteria the lower bound for the mean squared error 
of prediction is attained asymptotically. Shibata (1981) extended his (1980) 
results by obtaining an asymptotic lower bound for the integrated relative 
squared error of an autoregressive spectral estimate when the fitted order is 
determined from data and demonstrating that if the fitted order is selected by 
minimizing FPE, or AIC, then again the lower bound is attained asymptotically. 

As is well-known, an alternative method for autoregressive order selection 
involves the use of the criterion autoregressive transfer function (CAT) of Parzen 
(1974). Indeed, this criterion was introduced by Parzen for implementing pre- 
cisely the same “nonparametric” autoregressive model fitting approach to time 
series modelling as considered by Shibata (1980, 1981). In this approach, the 
behavior of an observed time series of length T, say, is modelled by an autoregres- 
sive process of order k. However, & is interpreted not as an estimate of the order 
of a finite autoregressive process, but as providing an optimal finite-order ap- 
proximation to a truly infinite-order process; in theoretical arguments, k is 
treated as a function of T and assumed to approach infinity simultaneously 
with T. 

Parzen (1977) has introduced a new version, CAT’, say, of CAT by slightly 
modifying the definition of the penalty function used for introducing CAT. Also, 
Bhansali (1985) has suggested an extension of the latter penalty function and, by 
examining the question of bias, he has introduced a new criterion, the CAT, 
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criterion, in which a > 1 is an arbitrary constant. The choice a = 2 corresponds 
to adopting the same penalty function as used for introducing CAT. However, 
the functional form of the corresponding CAT, criterion is not the same as that of 
CAT. 

In this paper, we show that the optimality property derived by Shibata (1980, 
1981) for FPE and AIC also holds for CAT, CAT* and CAT,. Our result thus 
establishes an asymptotic equivalence between these criteria when the generating 
process is an infinite-order autoregression. Note that for a finite-order autoregres- 
sive process, the asymptotic equivalence of these criteria is established by 
Bhansali (1985), who shows that the asymptotic distributions of their selected 
orders are the same. Empirical support for these asymptotic results is provided 
by Parzen (1977), Beamish and Priestley (1981), and Bhansali (1985), who report 
that these criteria frequently select the same orders. 

A related reference is Taniguchi (1980), who has suggested that the optimality 
property of AIC derived by Shibata for autoregressive model fitting may be 
extended to the nonparametric fitting of autoregressive-moving-average models; 
however, the proofs given there are not rigorous and appear to be incomplete in 
their present form. 


2. Definition of optimality of a selected order. Consider a discrete-time 
second-order stationary process {x,} with mean 0, covariance function R(u) = 
E(x,x,,,,) and satisfying the following assumption: 


ASSUMPTION 1. The process x, is an infinite-order autoregressive process 


fo] 


2 a(u)x,. = &; a(0) = 1, 


u2Q 


where e, is a sequence of independent normal variates, each with mean zero and 
variance o°, the {a(u)} are absolutely summable real coefficients, i.e., Lja(u)| < 
oo, such that the polynomial 


[sos 


A(z) = ¥ a(u)z" 
u=0 
is nonzero for |z| < 1. Also, x, does not degenerate to a finite-order autoregressive 
process. 


Having observed X,,..., Xr, suppose that the order k is selected from the 
range 1 < k < Ky, where K satisfies the following assumption: 


ASSUMPTION 2. {Kr} (T = 1,2,...) is a sequence of positive integers such 
that K; > œ, K2/T>0asT- œ. 


In Section 3, we also require the following assumption. 
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ASSUMPTION 3. 


œ 


Le |ul|R(u)| < 0. 


u= — oo 


The kth-order least-squares estimator a(k) = [@,(1),..., &,(R)]’ of the auto- 
regressive coefficients is a solution of the equation 
(2.1) R(k)a(k) = —#(k), 
where R(k) = [D(u, v)] (u, v = 1,..., k), Xk) = [D™O,),..., DO, RY, 
T 
D®(u, v) i nN“ 2 Xi-uğt-v 
t=Kr+1 


and N = T — K,. The corresponding theoretical parameter a(k) = 
[a,(1),..., @,(%)]) will also be needed and is defined by 


R(k)a(k) = —r(k), 


where R(k) = [R(u — v)] (u, o = 1,..., k) and r(k) = [R(1),..., R(k)}. Note 
that the a,(/) are the coefficients of the kth-order linear least-squares predictor 


of x, given x,_,,...,%,_, and 
k 
o*(k) = } a,(s)R(J), a,(0) = 1, 
J=9 


is the corresponding mean squared error of prediction. An estimate of o?(k) is 
given by 


k 
6(k) = Da (JDO, j),  â(0)=1. 


j=0 
Put 
k 
etk 7 ys aal j)X4_,- 
y=ud 
We may write 
(2.2) a(k) ~ a(k) = -R E XA)en N), 
t=Kr+1 


where X (k) = [Xi is- Xi 
For an arbitrary infinite-dimensional vector 8 = [8}, 6,,...]’, let 


lôllr = L È ôb R(u - »)} 
u=] p=] 
denote its norm with respect to the matrix R = [R(u — v)] (u, v = 1,2,...). 
Also, let a = [a(1), a(2),... T denote an infinite-dimensional vector. In the sequel, 
it will often be convenient to think of a(k) and &(k) also as infinite-dimensional 
vectors with 4,(j) = a,(j) =0(j> k). 
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Parzen (1974) suggested selecting & by minimising the criterion 
(2.3) CAT(k) = 1 - {62/67(k)} + (R/T), 
where 


6*(k) = T(T — k) *6°(k), 


n 
62 = orowan È log 1a) + rh 
gel 


T 
I) = (QeT)~'| È Xexp(—itd)/?, 
t=1 
y = 0.57721, w, = 277/T, and n is the largest integer not greater than (T — 1)/2, 
i.e, n = [(T — 1)/2]}. 
As discussed in Section 1, Parzen (1977) later suggested that k may also be 
selected by minimising the criterion 


k 
(2.4) CAT*(k) = T? $ 6-7(7) — &-7(R). 
j=l 
The CAT, criterion considered by Bhansali (1985) is of the form 
(2.5) CAT,(k) = 1 — {62/67(k)} + a(k/T), 


where a > 1 is an arbitrary constant, and k is selected by minimising this 
criterion. 

We note that the CAT* criterion has the advantage that its definition does not 
depend upon 62, which is the case with the CAT and CAT, criteria. Also, AIC 
has been defined without explicitly requiring the evaluation of 62. 

When Assumptions 1 and 3 hold, 62 converges in probability to o? as T > 00; 
see Bhansali (1985). As shown by Hannan and Nicholls (1977), Assumption 3 is 
strictly not necessary for this result to hold and may be replaced by an 
assumption requiring that the spectral density function of x, satisfy a Lipschitz 
condition. However, Assumption 3 is made for ease of exposition and it is used 
only for ensuring that 62 is consistent for o°. Also, this assumption has not been 
made for establishing the asymptotic efficiency of the CAT™ criterion. 

Shibata (1980) argues that if the objective of fitting the autoregressive model 
is prediction then the goodness of the fitted model may he evaluated by mean 
squared error prediction as defined by the following penalty function: 


Qr(F) = |la(z) — alle 


= ]a(k) — all + A(R) — alk). 


Shibata shows that if {kr} is a sequence of integers such that 1 < ky < Ky 
and kp > œ as T > œ then 


plin {Qr(Rkr)/Lr(kr)} =1, 


(2.6) 
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where 
L,(k) = 07(k) — o? + ko? /N. 


Now, let {k}} (T = 1,2,...) be a sequence of positive integers at each of which 
the minimum of L.-(k) with respect to k is attained, i.e., 


Ly(k?) = min Lr(k T=1,2,...). 
(Rž) isksKy r(F) ( ) 
Then kf > œ as T > œ. 


A remarkable result established by Shibata is that for any random variable k, 
possibly depending on X,,..., Xr, and for any ô > 0, 


Tm pr{Qr(%)/Lr(kž) > 1-68} =1. 
Therefore, a selected order k, is defined to be asymptotically efficient if 
(2.7) pe {Qr(k)/L7(k#)} = 1. 
This definition is also adopted in this paper. 


As is well-known, another motivation for fitting an autoregressive model is the 
estimation of the spectral density function, 


f(A) = 0%(2m) È a(u)exp(—iud)I* 


of x,. Let 
k 
f(A) = ê? (k) (27) 2 aiel m 


denote the autoregressive spectral estimate corresponding to the fitted kth-order 
model. Shibata (1981) suggests adopting the integrated relative squared error, 


In(h) = (2m) S (O) -IODA ar, 


as a penalty function for PAA) and determines a lower bound, 2L,(k#)/o’, for 
J,(k). Therefore, an order selection k is defined as asymptotically efficient from 
the point of view of autoregressive spectral estimation if 


plim {Jr(k)/Lr(kž)} = 2/0°. 


This second definition of asymptotic efficiency of a selected order is, however, 
related to that introduced at (2.7). Suppose that a selected order, k, is a random 
variable such that 


(2.8) pim {Lr(k)/Lr(kž)} =1. 


It follows from Shibata’s results that & is asymptotically efficient simultaneously 
from the point of view of prediction and spectral estimation. In particular, (2.8) 
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holds for the order selected by the criterion 
(2.9) Sp(k) = (N + 2k)67(k) 


considered by Shibata (1980, 1981), and, also, for the orders selected by FPE and 
AIC. We show in Section 3 that it also holds for the orders selected by CAT, 
CAT™*, and CAT,,. 

We note that although a relationship between AIC(#) and CAT,(%) has been 
established by Bhansali (1985), this relationship may not be employed for 
deducing our theorems as a direct consequence of Theorem 4.2 of Shibata (1980). 
Also, the remark made in lines 7 and 8 of Shibata (1980, page 162) concerning 
CAT does not apply to the criterion (2.3), but to the CAT, criterion with a = 1. 


3. Asymptotic efficiency of CAT, CAT*, and CAT,. For two arbitrary 
random variables X and Y, we write X < Y if pr(Y¥— X20)=1. 

We need the following lemma, which is an extension of Lemma 4.1 of Shibata 
(1980). 


LEMMA 3.1. Suppose that Assumptions 1 and 2 hold. Then 
pim Kp, max_{6%(k) ~ o*/(NLy(#))} = 


ao) 


T 3 3 ¥ jal lous) )| 


x|E[{(D™(, j) - R(L-7)}{DMU, 8) — RU- s)}]| 
< MK2/T, 


Proor. We have 


T 
N`! 2 X,(k)e, x 


t=Kr+l 


Kr 
K 5 E 
k=l 

















where M denotes a bounded constant, and converges to 0 as T —> œ. Hence, by 
(2.2) and Lemma 3.3 of Shibata (1980), 


plim Kr, max, (ja(#) ~ a(k) licn/(NLr(4))} = 0. 


The lemma therefore follows by an adaptation of the proof of Lemma 4.1 of 
Shibata (1980). D 


Let k, denote the order selected by minimizing the CAT, criterion (2.5). The 
asymptotic efficiency of ka is established below. 
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THEOREM 3.1. Suppose that Assumptions 1, 2, and 3 hold. Then 
plim (Qr(F2)/Lr({h#)} = 1 
and 
plim {Jp(k)/L7(kf)} = 2/0. 


T- 00 
Proor. We have, for all T > 1, 


(3.1) [(62/67(k#))(1 — (67(k#)/67(R,)}) + 2T- ($, — kž)| < 


with probability 1, because CAT,(2,) < CAT,(&#). Now as T > 0, re 
converges in probability to 1. Also as T > œ, 6 Xk, ) converges in probability to 
a bounded positive constant, because DTO, 0) > 6%k,) > ô? (Kr), where 
D‘™\(0,0) converges in probability to R(0) and ĉ?(Kr) to o2. Therefore, (3.1) 
implies that 


dim pr{Sp(ko) — Sr(k¥) + 2hf(62(kB) — 6?(k:)) < 0} = 1. 
The theorem may now be established from Lemma 3.1 and an argument similar 


to that used by Shibata (1980) for proving his Theorem 4.1 by demonstrating 
that 


pa {Lr(ka)/Lr(kž )} =L o 


Let ke denote the order selected by minimizing CAT. The asymptotic efficiency 
of ke is established below. 


THEOREM 3.2. Suppose that Assumptions 1, 2, and 3 hold. Then 
plim {@r(Åc)/Lr(kž)} = 1 
and , 
plim {Jp(kc)/Lr(kž)} = 2/0. 
Proor. On arguing as in the proof of Theorem 3.1, we have, because 
CAT($Żc) < CAT(k#), 
Jim pr{[(Sr(ko) - Sr(k3)) + 2k¥(6?(k) - ê? (åo)) 
+ho(6?(k#) — 6?(Ro))|] < 0} =1. 


The theorem may therefore be established by demonstrating that the last result 
implies 
pium {Lr(ke)/Lr(kž)} = 1. a 
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Let ko denote the order selected by minimizing the CAT* criterion (2.4). We 
have 


G3 
CAT*(k) = T? X, (T — j)6™?(j) — TUT — k)â™?(k). 


Jel 


We need the following lemmas. 


LEMMA 3.2. Suppose that Assumptions 1 and 2 hold, and let 


(3.2) U,r = |ê°(ko)â (kt) 3 8-*(j) — koô? (ko), 
a 

(8.3) Uzr = 19°C) 6B) £ aU) — kf6?( kf) |. 

Then 

(3.4) plim {Ur/NLr(ko)}=0 (f= 1,2). 


Proor. We need only consider (3.4) for j = 1, since the proof for j = 2 is 
similar. We have 


U,r/(NLr(ko)) <F+tR+A, 


where 
F, = {NLr(ke)} kelê?(kž) — 01, 
Py = {NLn(ke)) È 0) - 0% 
- 
F; = {NLp(ke)} X180) = èU) 
Now l 


F, < Řcjê?(kt) - 0° /(NLr(kž)) 
s Kr max {l6°(k) - o° V(NLr(k))} 
and converges to 0 in probability as T — co by Lemma 3.1. Also 
F. êj) -o?(j 
aS mar EUS 


and converges to 0 in probability as T > œ by Lemma 2.1 of Shibata (1981). 
Finally, consider F,. Since kž > œ as T—> œ and {NL,7(k,)}7! < 
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{NL,(k#)}~! < (k#)~', we have, for all T > Th, say, 
kt 
sad i 
Fas (kf) È lo?(j) — 03 
J=1 


and converges to 0 as T ~ œ because o?( J) > o? as j> œ 0 


LEMMA 3.3. Suppose that Assumptions 1 and 2 hold, and put 
(3.5) Usr= (ko — kt)(ô? (ke) + 6°(h#)} — 2koâ? (ke) + 2kz6?(kž). 
Then 
pun {lUsr/(NLr(ke))} = 0. 


PrRoor. We have 


\Usr/(NLr(ko)) s Aia tee Mee) o°|/NL7(k)} 


and converges to 0 in probability as T — œ by Lemma 3.1.0 


LEMMA 3.4. Suppose that Assumptions 1 and 2 hold and let 


ke 
(3.6) Ur = T~'8?(Re)67(kE) 78-7) 
j=l 
and 
2 hf 
(3.7) Usp = T6? (ke)? (kž) L Jê? (J). 
yet 
Then 


plim (Ur V(NLr(ke))} =0 (7 = 4,5). 


T= œ 

Proor. The lemma follows directly from Lemma 3.2, by noting that 
K/T > 0as T > œ and 1 < ko, ki < Kyr. 0O 

THEOREM 3.3. Suppose that Assumptions 1 and 2 hold. Then 

piii {Qr(ko)/Lr(kł)} =1 
>W 
and 
pe {Jr(Řo)/Lr(kž)} = 2/07. 

ProoF. On arguing as in the proof of Theorem 4.1, we have, because 

CAT*(Ko) < CAT*(kž), 
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where 
Up = Ur — Uzr + Usp + Use ~ User 


and U,r (J = 1,...,5) are as in Lemmas 3.2-3.4. The theorem may therefore be 
established from these lemmas by demonstrating that the last result implies 


pee {Lr(ke)/Lr(kž)} =li. o 


4. Discussion. As in Shibata (1980), let k“ be the order selected by 
minimizing the criterion 
(4.1) SS(k) =(N+ak)6(k) (1<k< Kp), 


where a > 0 is on arbitrary constant. It follows from Theorem 3.1 that the order 
selected by minimising CAT is asymptotically equivalent to ® rather than to 
k as suggested by Shibata (1980). 

For any fixed a, &@ and k„—the order selected by minimizing the CAT, 
criterion—are asymptotically equivalent. Hence, by repeating the arguments of 
Shibata, it follows that if a # 2 and, as k > oo, {0*(k) — o”} —> 0 geometrically 
then 2, is not asymptotically efficient in the sense defined in Section 2. However, 
k, is still asymptotically efficient if, as k > œ, 0°%(k) — 07, goes down to 0 
exponentially, which is the case if f(A) coincides in —z <A < a almost every- 
where with a function that is analytic for real À and has no real zeroes; see 
Grenander and Szegé (1958). In particular, if x, is a Gaussian autoregressive 
moving average process of order (p,q), with q > 0, to ensure that it does not 
degenerate to a finite autoregression, then o*(k) > o? exponentially as k > œ 
and Assumptions Al and A3 hold. Thus, for this important class of processes, k, 
is still asymptotically efficient, in the sense defined in Section 2, for any a > 1. 

The order of decrease of o?°(k) to o? is of course usually unknown. However, 
whether x, is an autoregressive process of infinite, or finite, order is also usually 
unknown. If the order is finite then a choice of a = 2 is not necessarily optimal 
because it leads to an inconsistent estimator of the order, and, also, because, with 
a finite T, no one choice of a is always optimal for all processes and all values of 
T; see Bhansali (1985). 

We finally note that if instead of (2.6), a generalized penalty function 

QF (k) = lla(k) — alla + a'fa(k) — a(k DIR 

is defined, where a’ > 0 is an arbitrary constant, then the arguments given in 
Section 3 show that for a = a’ + 1 the order selected by minimising the CAT, 
criterion is asymptotically efficient with respect to this generalized penalty 
function, and so are the orders selected by minimising the FPE, and AIC, criteria 
of Bhansali and Downham (1977) and Akaike (1979). A motivation for consider- 
ing this generalized penalty function is that by varying a’ the two terms on the 
right of (2.6) may be given unequal weights; see Bhansali (1985) for a discussion 
of the reasons for considering this possibility. The question of how best to choose 
a is considered by several authors; see, for example, Atkinson (1980), Smith and 
Spiegelhalter (1980), Akaike (1979), Shibata (1983), and Bhansali (1979). 
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The size of error is investigated when the log-likelihood of multivariate 
generalized linear models ıs approximated by a quadratic function. The 
nonquadratic tail is characterized by analyzing the cubic part of the log-hkeli- 
hood. In a local analysis simple bounds for that part can be expressed ın terms 
of expectations of the related random variables for arbitrary sample size N. 
Additionally global error bounds are given for the univariate case. 


Introduction. In many applications the log-likelihood is approximated by a 
quadratic function. An adequate approximation and the corresponding asymp- 
totic theory ensure that the classical linear regression results apply to nonnormal 
models as well; cf. Lawless and Singhal (1978). The adequacy depends on the 
parametrization (cf. Section 2) and on the size of the global approximation error, 
for which until now no bounds have been published. 

However, the investigations of Minkin (1983) and of this paper may be viewed 
as steps in this direction. Both articles assess local errors in the main. Minkin 
(1983) considers independent observations %,„ of a univariate one parameter 
family and the linear relationship 6, = KiB: He gives relative and absolute 
bounds for the approximation error in a region R, characterized by a quadratic 
form using the information matrix 


Q(B, Ê) = (B - ÊV JÊNG - Ê), 
namely 
R, = {BIQ(B, Ê) < a}. 


Minkin’s bounds depend on the analysis of more or less complicated remainder 
functions. The paper treats logit and log-linear models explicitly, which are 
special cases of univariate generalized linear models (GLM’s); cf. Nelder and 
Wedderburn (1972). 

In this article global error bounds for logit, log-linear, and models depending 
on the gamma distribution are presented. The main subject, however, is a local 
analysis of the relative quadratic approximation error in multivariate canonical 
GLM’s. Hence we cover the important multinomial case. 

In Section 1 the third derivative structure of an exponential family with 
canonical parametrization is analyzed. A general technique for obtaining bounds, 
applicable to all distributions of this kind, is derived in Section 2 and explicit 
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results for the Poisson, gamma, and multinomial distribution follow. In any case 
it turns out that the size of the cubic terms compared with that of the quadratic 
one is of the same order as the relative deviations in the expectations of the 
related random variables. 

In the general multivariate case only local error bounds are available. Never- 
theless they reveal what really happens in the nonquadratic tail. In Section 3 this 
can be seen by comparing global and local errors for the univariate distributions 
mentioned above. The results can be used in Section 4 to obtain a bound for the 
cubic error terms in the quadratic approximation of likelihood ratio (LR-statis- 
tics in GLM’s which is valid for arbitrary sample size N. 

Finally it is shown how the results can be used to test linear hypotheses in 
common GLM applications, like log-linear models or contingency tables. A special 
multinomial regression model with continuous factors, namely the logistic dis- 
criminant approach [cf. Anderson (1972)], completes the last section. 


1. Third order properties of exponential family densities. Before we 
can treat the subject of our paper we summarize some properties of exponential 
families in canonical parametrization and the multinomial distribution which are 
needed later. 

Consider a discrete or continuous random variable Y = (Y,) € R? having an 
exponential family density 


(1) f(y\@) = c(y)exp{y78 — 6(8)} 

with respect to the counting or Lebesgue measure. The canonical parameter 
6 = (0) E R? is not involved in c. The partial derivatives of b, b, 
Dik» Onn ++» Generate the cumulants of Y. With 

(2) w= (u,) = E(Y) = (4), 

2 = (0) = cov(¥,, Yp) = (ba), ete. 

According to Barndorff- Nielsen (1978) the correspondence between @ and p is one 


to one and there is a scalar function g(), namely the Legendre transform of b, 
with partial derivatives g,, 8,s &;.:--- Such that 


b, Fa gde) 


(3) z 
2e (g,.(H)), 
and 
(4) Dirt ae 3 Ersel B) OOs J, k,l, r,s,t=1,...,q. 
r,s,t 


The partial derivatives of g(-) appear in related calculations by McCullagh 
(1984). If we regard the third derivative tensors B = (b;,;), G = (8,.;) 85 oper- 
ators with three arbitrary arguments u, o, w and i = Zu, 6 = Zo, and w= 
Zw E€ R9, respectively, we obtain from (4) 
Blu, o, w] EmA L b pru, = pac X ErstU D D, 
(5) JRL r,3,t 
= —G[ Xu, Zv, Zw]. 
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Finally we give the expressions above explicitly for the multinomial case, because 
these relations are needed in the proofs of the subsequent section. There the 
factorization (5) and the simple structure of G play a fundamental role. 

one the multinomial Y = (Y, Y,,...,Y,)" with E(Y) = (Tos My ees T Ag 
my to tm = a nan a singular distribution, A does not cars in aa We obtain 
b(9) = Ini 1 ef! + --- +e%) and with the Kronecker delta ô, 





B 1 8, 
(6) = = (g,,) = —+—}], 
To T, 
i GB se 
(7) G = (gr) = | -3 2 |? r,s,t=1,...,q, 
m Ty 
and hence with i, 0, Ww from (5) 
(8) B[u, v, w] = u7(z7,,{7, 0})5, 
where 


W, 
ing = pag L Dne and v= (7, Ar 
T 


r t 


2. Local error bounds in quadratic log-likelihood approximation. For 
normally distributed random variables with known variance the log-likelihood is 
quadratic in the mean value parameter. Outside of this special case 
(9) L(8) = In {(318) 
is often approximated by some quadratic function. The adequacy depends heavily 
on the parametrization used. Here we investigate quadratic Taylor polynomials 
for the canonical parameter 0. In this case derivatives of order greater than 1 do 
not involve the random variable y. Hence the bounds given in Section 4 for 
GLM’s are valid for every sample size N (not only asymptotically). Here we 
present our results concerning the structure of the cubic terms in exponential 
family log-likelihoods. Global bounds are to be found in the subsequent section. 

L(@) is expanded about some fixed @* and u, »* denote the corresponding 
expectations. It turns out that relative errors in the components of » and u* are 
responsible for an inadequate quadratic approximation, i.e., 


Hy — BS 


€ = Oj max 
BS 














With the quadratic form 

(10) Q(8, 0*) = (8 — 6*)"Z{"}(0 — 6*), 

the Taylor expansion of L(@) is 

L(8) — L(9*) = (y — u*)"(8 — 6*) — 3Q(8, 8*)(1 + 405(4, 0*)) 
+R,(6, 0*), 

where —9,Q/6 and R, describe the terms of order 3 and 4, respectively. 


(11) 
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2.1. The univariate case. For simplicity we first present the results for 
univariate Y, i.e., q = 1. Denote by b™ and g^? the j-order derivatives of b and 
g, respectively (e.g, g® = g,,,), and let 62 = b®(9*) = o,,(6*). Now (11) can be 
written as 


iy HO) LO) = (w= KAO = 8°) = $0(0 = 0° 


x [1 + 4b(0*)(6 — 6*)/o3| + R,(0, 8*). 
From the results of Section 1 we deduce 
_ nt 
(13) CaS ata + O(u— pt)’. 
Ox 


Together with (5) this shows that p, and 
(14) på = —8(n*)o%(m — p*) 
differ only by terms of order 2. Hence we can replace p, by p% in (11) and get a 


new remainder R¥ which is of the same order as the original R, (as p} is 
multiplied by Q). Using the notation of (3) we obtain 


(15) @=In(vy)=8,(4); gH) = -H% 0? =p = 58) 
for the Poisson and for I'(y, y) gamma variables: 
0 = -y = —v/n = gi(); 


g(u) = -2mu-%; a? = w/v = B68). 
This leads to our first crucial result: 


(16) 


THEOREM 2.1. Size of p$ for Poisson and gamma distribution. The relative 
size of the cubic terms compared with the quadratic ones in (11) is 
|b — y 


* 3 


| 





(17) lpsl =C 
where Cpcson = l, Cgamma = 2+ 


2.2. The multinomial case. In the multivariate case p, of (11) can be replaced 
again by p3. After this we can formulate the following theorem for the quadratic 
approximation error of the log-likelihood: 


THEOREM 2.2. Size of p3 in the binomial and multinomial case. Let m%,..., m, 
as given in Section 1; then for q = 1 (binomial) we obtain 


(18) ee mm moa 
rE 








? 


at Ci 
whereas forq > 1 


* 
Ta 


Tm 
(19) lož] < 2y(q +1) max ———. 
O<m<q 


* 
Tm 
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The bound in (18) is slightly stronger than (19) for q = 1. The additional factor 
(q + 1)'” is due to Lemma 2.5 ( formulation to follow). 


Before proving (18)-(19) we expose how these results can be used in practical 
applications. 


REMARK 2.3. Size of quadratic approximation error. From the theorems we 
conclude that the relative size of the cubic terms is small if the leading digits of 
the expectations u and p* coincide. In Section 3 we show that small p, yields a 
small global approximation error R, under very general conditions. Hence the 
approximation error is given in the main by the simple formulas (17)-(19). 
Conversely, if the leading digits of p and yu* differ, the quadratic approximation 
may be bad. So we have derived a simple criterion for the adequacy of quadratic 
log-likelihood approximations of exponential families (and as we see in Section 4, 
of GLM’s), which can be easily checked during a computational procedure. 


PROOF OF THEOREM 2.2. For the special binomial case we obtain with (7) 
and 7 =1-— 7 


is * 
(20) E 
min 


-m-i 


* 
To 


which is the same as (18). 0 
Since the proof for multinomial Y is tedious only the main steps are sketched. 


OUTLINE OF THE PROOF FOR MULTINOMIAL Y. Let = := Z{6*} and T{-} = 
(1,,{7*,-}) which are positive definite or symmetric, respectively, and further 
m= (m... 7) and e= (1,...,1)". We need two linear algebra lemmas which 
are stated without proof: 


LEMMA 2.4. Factorization of the spectral radius r of a symmetric product. 
For arbitrary u € RY 
(21) uP ST Xu < r[ DTS JuT Su. 
LEMMA 2.5. Bounds for spectral norms. 
IZ- rella < ya +1, 


IIe" ||, < y4. 
Now the three steps of the multivariate proof are as follows: 
(G) Let u = v = w := 6 — 8*. Then (8) and Lemma 2.4 yield: 
|B[ w, w, w]|= WTET{Zw}2w] < r[ET{2w} 27] Q(8, 6*) 


(22) 


(23) 
= r[ST(2(6 — 0*)}]Q(0,6*). 
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(ii) Analogously to (13) we expand w = @ — @* and obtain 
(24) 3(0 -— 6*) =7—-a*+R,, 


where R, again is multiplied by Q and hence is of order 4. 
(ili) Define p,, = (Tm — ™)/ta, m= 0,...,9; D{p} = diag{p,,..., pz}; then 
(6) and (8) yield in matrix notation š 


r[2T{m — 7*}] <I- re™)D{p}le + lIe7 Ilo po 
< 2yq + 1 max|p,,|, 


where we have used r[A] < ||A||. and applied Lemma 2.5.0 


REMARK 2.6. The techniques described can be used in other cases as well, 
because steps (i) and (ii) work for any exponential family. Step (iii) and Lemma 
2.5 apply to the class of “multinomial like” distributions with a Z- and G-struc- 
ture analogous to (6)-(8), e.g., to the negative multinomial and the logarithmic 
series distribution [cf. Johnson and Kotz (1969)]. 


3. Global error bounds depending on the size of p}. In this section the 
connection between the global approximation error R, in (12) and the relative 
error p, of cubic terms compared with quadratic terms is analyzed. For the 
Poisson and gamma distribution small p, causes a small R,. This is not true for 
the binomial case. Here p, = 0 is possible, whereas the global quadratic ap- 
proximation error can be quite large (cf. Section 3.2). Note that the bounds of the 
previous section hold for p$. But according to (14), p and p% differ only by terms 
of order 2. 


3.1. The principle. We use the notation of Section 2.1. For an arbitrary 
univariate exponential family the following approach works. 
Starting from equation 


(25) pa = (8 — 6*)bO(0*) /bO( 8") 
the error term of (12) is 


b®(6*) | 


(26) R4(0,0*) = TOOD Fare 


where 6 is some convex combination of @ and 6*. 
Hence we get the following: 


THEOREM 3.1 (Global approximation error). The remainder R, of the Taylor 
expansion (12) can be bounded by 


[Ral < pM /4!, 
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where 
(27) M poison = max{e’,e*}, 
3p o* 
(28) I gs ig = man 1, a 
(1 + e%)* N 
(29) M pinoeniat = ase" 0* #0. 


Proor. By inserting the jollowing quantities 


b(8%)/0 0") oE) 
Poisson 1 e? 
gamma —8*/2 6»/8* 
binomial Q +e- e) LACER 
cf. Section 3.2 


(27) and (28) are valid for arbitrary 8, @*, whereas the binomial bound approaches 
infinity for 6* —> 0. This case is investigated in the following section. O 


3.2. Global bounds for the binomial distribution. The second order Taylor 
expansion of the binomial log-likelihood is 


(30) L() ~ L(6*) = (y ~ p)(@ — 0*) — 403(0 — 0*}? + R(0, 6*), 
where of = e®"/(1 + e”). Since B©(8) = e8(1 — e®)/(1 + ef} and bM(6) = 
e9(1 — 4e? + e7*)/(1 + e*)4 are uniformly bounded by 1 we have: 


THEOREM 3.2 (Global quadratic approximation error of the binomial). The 
remainder of the log-likelihood Taylor expansion (30) is bounded by 


(31) |R(O, 0*)| < |@ — 6*|9/3! 
and for 6* = 0 we obtain 
(32) |R(@, 0*)| < 04/4!. 


3.3. Interpretation of the bounds. Working with quadratic log-likelihood 
approximation we are interested in how large the neglected terms are compared 
with the approximating ones. This problem is solved exactly only for uni- 
variate Y. 

Except the binomial case for p = 0, which has been analyzed separately, for 
Poisson, gamma, and binomial random variables small p, yields a small global 
approximation error. This justifies the practical relevance of p, as a measure for 
the adequacy of the quadratic approximation. Note that the roles of 6 and @* in 
(11) and (12) may be interchanged which possibly improves the bounds (27)-(29), 
ega, Masmma = $P. 
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An explicit analysis for the multinomial case seems too hard because of the 
difficult multivariate structure. 

In the last chapter we demonstrate how the results can be applied to model 
selection in canonical generalized linear models. 


4, LR-statistic approximation and model selection in GLM’s. The theo- 
rems of Section 2 refer to the log-likelihood L itself and not to any expectation of 
L or its derivatives. In the following the results are extended in a natural way to 
generalized linear models (GLM’s). Hence we obtain computationally available 
bounds for the quadratic approximation error of the according LR-statistic which 
are valid for any sample size N. An application of hypothesis testing in logistic 
discriminant analysis completes this section. 

Consider independent exponential family response variables y, € R? char- 
acterized by parameters &,, © R? and linked with regressor variables x, © R?, 

i = 1,..., N. According to (1) the common log-likelihood (here denoted by Ly) is 
given By 


(33) Ly(@) = const. + X {90% — blhan) } 


where @ is the collection of the 6,,). 

Although our approach covers the multinomial case, too, we restrict ourselves 
to q = 1, for simplicity, and consider GLM’s, where the unknown parameters 
ê. E€ R range in a linear subspace, spanned by the x,,), i-e., 


(34) 6 = XB. 


X = (xl --- x{y))” denotes the usual design matrix and £ € R” is an unknown 
parameter vector. 

In the notation of Nelder and Wedderburn (1972) the model described by (83) 
and (34) is a canonical GLM, because the &,, themselves, and not some functions 
of them, range in a linear subspace. 

Among several authors McCullagh and Nelder (1983) expose that a variety of 
classical applications, like log-linear models, contingency tables, and binary and 
survival data analysis fit into the GLM framework. Furthermore, according to 
Fahrmeir and Kredler (1984), the logistic discrimination approach, which seems 
to be superior to the widely used linear Fisher discrimination, [cf. Press and 
Wilson (1978)], leads formally to a multinomial GLM. 

Model selection and checking the goodness of fit in the models mentioned 
above usually is carried out by tests of hypotheses H, concerning the parameters 
6 or B, respectively. Denote by Êy = XB, and 6 = XB the maximizer of L N 
under and without restriction H. As VL,(6) = = 0 the Taylor expansion of the 
LR-statistic is 


A= ~2{Ly (x) z Ly(6)} 


= Qn (by, Ô) + p&/3) — 2RX®, 


(35) 
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where Qy, p*, and R* denote the quantities (associated with the total sample 
size N ) corresponding to Q, p$, and R% of Section 2. 
Applying the theorems of Section 2 yields directly: 


THEOREM 4.1 (Size of error in quadratic LR-statistic approximation). For 
Poisson, gamma, and multinomial response variables we obtain 








H 
Him — Bim 
(36) lps] < 2yq +1 max max |=], 
t Osmsq Hım 
where ĝÎ (Âw = 1- pua e Âq for multinomial y,)) and p? denote 


the estimated expectations under and without restriction H. According to (36) 
the LR-statistic can be replaced by its quadratic Taylor polynomial, presumably 
the relative errors of j,, and pe are small compared with 1, i.e., the leading 
figures coincide. For the test of linear hypotheses 

H:Aß =a 
in the models mentioned above we can proceed as follows: 


(i) Compute 6, Ly(8), and -Y 
(i) Compute approximations for Ên 4, and the LR-statistic à in a suitable 
linear model with cross-product matrix —v*L (6). 

(ii) Accept the quadratic approximations of (ü) if the leading digits of ĝ,, 
and p}, coincide. 

Usually during the process of model building a variety of nested linear 
hypotheses, e.g., concerning the significance of certain parameters, are carried 
out. For a sequence of tests the same linear model can be used as long as 
condition (iii) is fulfilled. If a is violated we take a new adjusted cross-product 
matrix instead of — V ?Ly(ĝ). For example, in logistic discriminant analysis [cf. 
Anderson (1972)], the a posteriori probability for the classification of observation 
(i) in group Q,, k = 0,1,..., q is given by 


g 
T(x, Bm) = exp(x/, fw) X e(z Bm), Bo = 9. 


This leads to a canonical multinomial GLM. We set x,, = 1 and start with the 


explanatory variables x,»,...,,,- In order to check which variables are signifi- 
cant for discrimination, we test nested hypotheses like 
H,: B, = Bp-i = ... =ß,=0, J= D,..-,2. 


The corresponding values of the log-likelihood Ly( br, ), then can be computed in 
the following way: 


Ly() {exact}, Ly (8), +1 Ly( Oy 


k+l 


update Lyf ĝa) {exact}, Ly(by,_,); ... {quadratic approximation}. 


) {quadratic approximation} 


For variable selection problems this yields an effective and fast procedure [cf. 
Kredler (1984)]. 
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5. Conclusions. The analysis of the LR-statistic for multinomial, Poisson, 
and gamma samples, given in this paper, applies in an analogous way to related 
distributions like the negative multinomial and logarithmic series. This paper 
gives an algebraic analysis of the quadratic log-likelihood approximation. The 
analysis depends heavily on the use of the canonical parametrization of exponen- 
tial families, only in this case higher derivatives of the log-likelihood do not 
depend on the random variable Y. Consequently the error bounds for canonical 
GLM’s given in the last section are valid for arbitrary sample size (not only 
asymptotically). Since the analytic dependencies between canonical and expecta- 
tion parameters play a crucial role our results cannot be easily extended to other 
parametrizations. Nevertheless our analysis works for many important applica- 
tions. 
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Since ther introduction by Dwass (1957) and Barnard (1963), Monte 
Carlo tests have attracted considerable attention. The aim of this paper is to 
give a unified approach that covers the case of an arbitrary null distribution 
in order to study the statistical properties of Monte Carlo tests under the null 
hypothesis and under the alternative. For finite samples we obtain bounds for 
the power of the Monte Carlo test wrt the original test that allow determina- 
tion of the required simulation effort. Furthermore the concept of asymptotic 
(resp. local asymptotic) relative Pitman efficiency (ARPE, resp. LARPE) 1s 
adapted to Monte Carlo tests for the study of therr asymptotic behaviour. 
The normal limit case is investigated in more detail, leading to explicit 
formulas for ARPE anc LARPE. 


1. Introduction. In many hypothesis testing problems, where the null dis- 
tribution of the desired test statistic is either unknown or too complicated to 
evaluate, Monte Carlo techniques are now widely used. Since simulated critical 
values are subject to sampling error, the use of Monte Carlo tables, such as the 
famous Lilliefors tables [Lilliefors (1967), (1969)], leads to tests that will not have 
the exact level of significance. The excess, which is unknown to the consumer of 
such tables, may be considerable, especially for Monte Carlo studies of only some 
hundred replicates. Furthermore for certain applications, for example permuta- 
tion tests and intuitive test statistics for the analysis of spatial patterns, a 
tabulation of Monte Carlo critical values is not feasible. These considerations give 
rise to the so-called Monte Carlo tests [independently proposed by Dwass (1957) 
and Barnard (1963)]: For each application of the desired test a simulation 
experiment of moderate size is carried out, delivering realizations of the test 
statistic under the null hypothesis. The Monte Carlo test then decides by 
comparing the simulated values with the observed value of the test statistic. 

For a continuous null distribution it has been shown by Hope (1968) and 
Birnbaum (1974) that an appropriate version of the Monte Carlo test procedure is 
of exact size a. Dwass (1957) obtained the corresponding result for the case of the 
Pitman two sample permutation test (which has a discrete null distribution 
conditional on the order statistics) by letting the sample size of the permutation 
test tend to infinity. Besides the papers mentioned above, there are others on this 
subject by Foutz [(1980), (1981)], Jéckel (1981), and Marriott (1979). But there 
remain open problems: 


1. The case of an arbitrary null distribution has not previously been treated. 
2. Power considerations are based on rather strict assumptions. 
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3. A general asymptotic efficiency concept does not exist. 
4. The simulation sample size is still open to question. 


In Section 2 we give a unified approach for arbitrary null distribution of the 
test statistic. In Sections 3 and 4 we discuss the performance of the Monte Carlo 
test with respect to power and asymptotic efficiency. The results may be used as 
recommendations for the simulated sample size. 

It should be noticed that Monte Carlo tests have already become a part of 
applied statistics; for examples see Besag and Diggle (1977), Green (1977), 
Hollander (1971), Ripley (1977), and Tsutokawa and Yang (1974). 


2. Definition of the Monte Carlo test procedure, performance under the 
null hypothesis. Let {P,: 0 © ©} be a family of probability measures on some 
sample space (x, yY) (normally a Euclidean space with the Borel sets). We want to 
test 

Ə, versus 8, = 8 — ©, 
having observed x € x. Consider an appropriate test statistic T with values in 
the reals. The following assumption often will be fulfilled: 
There exists 0* € ©,, such that 
(2.1) P(T < t) < Py(T<t) < P(T < t) 
for all (0, 81) € 9, X 8.. 
Furthermore we adopt 
(2.2) F,.(t) = P(T < t) is a continuous distribution function. 
Then 
(x) = Le-w, -F T), 
where 
“W (a) = sup{é: Fp (£) < a} 
is of size a and according to (2.1) unbiased. Although in many practical situations 
neither F,.(T(x)) nor ~'Fj.(«) are manageable, it is often possible to simulate 
random elements x,,..., Xm distributed according to P,.. If the testing level « is 
an integer multiplier k of 1/(m + 1) the Monte Carlo test (of simulation size m), 
$m rejects ©, if the observed value of the test statistic t = T(x) is less than or 
equal to the kth order statistic ¢,.,, of the simulated values t, = T(x,). 

For convenience ¢,, will be called the MC test (wrt T} or the MC test 

corresponding to ¢,. It should be observed that ¢,, is a randomized test depend- 


ing on an independent simulation experiment. Because of (2.2), ¢,, may be 
equivalently rewritten as 


m 
1 if È leoo hye k= 1, 


re] 


m 
O SE Das > ka: 


t=] 


On(t) = 
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Denote 
Foe (E) = inf{x: F.(x) = £} 


and let for a € (0; 1), b(a, m, £) be the density of a beta distribution on the unit 
interval with parameters p = (m+ 1)a and q = (m+ 1X1 — a); then by some 
calculus we have the following identities: 


Evin = f BG JED] [L - F(t)” drt) 


= 


Similar results have been obtained (under more restrictive conditions on the 
testing problem) by Hope (1968) and Birnbaum (1974). Obviously 


[ EAEE) (a, m, £) dg = [’Fo(~'Fye(€)) (a, m, £) dg. 
For a fixed alternative parameter, 0, € @,, say, 


Ba (a) = F(C \F,-(a)) 


is the power of the tests p, for {0*} versus {@,}, considered as a function of «. 
The corresponding quantity for the MC test may thus be calculated as 


Enon = Í "Bo (£)b(a, m, £) dé 


f b(a, m, €) dé dF(t) = [Fo Fe'())6(a, m, £) dé. 


00” Foa(t 


and furthermore we have 
Eg om < Egebp, Sas Eo bm for all (%, 6,) € o x O.. 


If the alternative parameter is fixed as we shall assume in the sequel, the 
subscript 6, will be suppressed. Although in our context the function 


oe [ "B(é)b(a, m, £) dé = B,,(«) 


makes sense only for values a € {1/(m + 1),...,m/(m + 1)}, it is convenient to 
extend the range and regard £,,(a) as the power of the MC test as a function of 
the level a € (0; 1). 

The remainder of this section is devoted to the case of an arbitrary null 
distribution F», apparently an open problem. So for the rest of this section we 
drop (2.2). In this case the appropriate test 9, of size a is 

1 if T(x) <~ F(a), 
(2.3) ¢{2)={y(a) if T(x) =F (a), 

0 if T(x) > ~F (a), 
where the randomizing constant y(«) is chosen such that Eps, =a. Let us 
consider the “test statistic” 


T(t,U) = F,.(t) — U- P,.(T = 6), 
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where U denotes an independent uniformly (on(0;1)) distributed random vari- 
able. T is not a test statistic in the strict sense, but may be regarded as some 
randomized kind of test statistic and the test ¢, based on T is completely 
equivalent to (2.3), viz. 


Esp, = E,¢, for alld € O, 
as may easily be seen. 


Consequently the distribution of T under 0* is uniform on the unit interval. 
Unfortunately the calculation of the corresponding MC test 


m 
1 if Ds Lee, Fe, wo (T(t, u,)) <k-1 
t=] 


(2.4) ba(t) = a : 
0 if 2 le-o, Fu, (Tto u,)) = k = 1 
t=] 


still involves the unknown quantities F,.(t), F.(t,). (The u and w,’s, respec- 
tively, are realizations of independent uniformly distributed random variables.) 
This problem is solved by 


PROPOSITION 2.1. The MC test corresponding to $, may be written as 
(k/(m + 1) = a) 


tl gt) =a 


ta] 


(2.5) Vi = ee 
0 if 2 lw, plé) 2 k, 
=l 
y otherwise, 
where 


~1 


„= |x È > 1-n.n(6)|(( È tae) +1 


tel 





This test is of exact size a. The power of |, may be calculated as 
(2.6) Bala) = {’B(E)b(a, m, £) dé, 
where (a) denotes the power of p, 


Proor. One first observes that the tests y„ and ¢,, ((2.4) and (2.5), respec- 
tively) are equivalent: Given ż, ¢,, f2,..., Ém both tests reject the hypothesis with 
the same probability as may be easily verified. Since ¢,, is the MC test wrt T 
(possessing a continuous null distribution), an appeal to the results obtained for 
the continuous case completes the proof. O 


3. Power considerations for MC tests. This section is devoted to the 
study of the power £,,(a) of the MC test under a fixed alternative parameter 6, 
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(if not stated otherwise). We make use of the terminology of the preceding section 
and additionally assume lim,_,,8(a@) =0 and lim,_.,A(a) = 1. An essential 
assumption for A(-) is its concavity. This assumption is satisfied if the ¢,’s are 
most powerful size-a tests based on T, a condition that will often be met. For a 
more detailed discussion on this point we refer to Jöckel (1982). The next 
proposition shows that the concavity property carries over to the corresponding 
MC test. 


PROPOSITION 3.1. If B(a) is a concave function of a then so is B,,( a). 


Proor. Since the densities b(a, m, £) constitute a one-parameter exponential 
family in a, they are by a result of Karlin (1968, page 18) strictly totally positive 
of order co. Since 


['o(a,m,e)dé=1,  f'Eb(a, m, £) dé = a, 
0 9 


we may apply Proposition 3 in Karlin (1968, page 23) and conclude that the 
mapping 


a> [B(£)b(a. m, £) dé 


is concave. O 
REMARK. It may be shown that £,(-) is strictly concave unless A(-) is linear. 


From a practical point of view the question arises whether an increased 
simulation sample size m yields an increased power. This question has already 
been treated by Hope (1968) for a special case, but the general case has remained 
unsolved. We give 


THEOREM 3.2. If B(-) is concave, then the power of the corresponding MC 
test B,,(@) ts a monotone increasing function of simulated sample size for all a. 
Furthermore B,,(«)t B(«) uniformly in a € [0,1] as m > œ. 


Proor. Denoting 
T'((m + 1)e)T((m + 1)(1 - a)) 


By = T(m + 1) 





we have by (2.6) 


a l-a 
Bn+i(4) ~ Bala) = f On On Bot 3, m, £)B(é) dé, 
“0 m+ 
the integrand being 
>0 iff € (z,(a), 2,(@)), 
=0 ifff=z,(a) orz,(a), 


<0 ifie [z,(a), z,(a)], 
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with some suitable functions z,(-), determined such that 

(z,(a))"(1 — z,(a))' * = B,,,/Ba, i= 1,2and0 < z(a) <2(a) < 1. 
Considering 
B(z,(a)) — B(z,(«)) 


Z,(a) — z(a) 





L£) = Blz(a)) + (£ — 2,(@)) 


with 
LAE) < B(é) for te [z,(a), z(a), 
LLE) = B(é) for &¢€ [z,(a), 2(a)], 
we obtain 


Í 'B(E)(b(a, m + 1, £) — b(a, m, £)) dé 
> f'L(&)(b(a, m + 1, £) ~ b(a, m, &)) dé = 0, 


which had to be shown. The uniformity is, as in the following corollary, a simple 
consequence of Dini’s theorem. 0 


REMARK. If B(-) is continuous at a, then by L*-convergence of b(a, m, -) to 
a we have £,,(a) > (a) as m —> oo. This has already been shown by Birnbaum 
(1974) and Hope (1968) under more restrictive conditions. 


COROLLARY 3.3. Let O = ©, U ©, be a topological space. Consider for the 
testing problem ©, versus ©, the level-a tests ¢, and assume that the power 
By a) = foba dF, is a concave function in a, and that for fixed a € (0; 1) 


6, > Ba (a) is continuous. 
Then 


[Bo,(€)b(e, m, £) dé > Ba (a) asm > 00 
uniformly on every compact subset of 9,. 


From the results obtained so far it should be clear that, with respect to power, 
an increase in simulation sample size is always desirable. Since, however, the 
increase of simulation effort considerably increases computer costs, there remain 
two important questions: 


(3.1) How much power is lost by using MC tests; more precisely 
what may we learn about the minimal value of B,(a)/B(a)? 

(3.2) How many simulations are needed so that the ratio 8,,(a)/B(a) 
exceeds a given constant? 


The first problem has already been tackled by Dwass (1957) in the special case of 
the Pitman two sample permutation test. His result is obtained under more 
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restrictive conditions and only in the limit as the number of observations of the 
permutation test tends to infinity. 


THEOREM 3.4. Let a € (0;1) be fixed, let y <[(a),1] and B(£) be the 
power of the test defined by 





i Éa if E< a, 
—— o, + y i>a. 
i—a l-a 


If for all £ € [0; 1] 

(3.4) B(E) < BCE) 

holds, then we have 

Bala) _ 1- (0/Plo) 
Bla) 2a(1 — a) 


where Zm a has a beta distribution with parameters p = a(m + 1) and q = 
(1 — a)}(m + 1). Furthermore we note that 


(3.5) 1 





E\Z, T al, 


m,a 


El|Zn a- a| = 


m,a 





(a%(1 — a) e 
T(m +1) 
*T((m + i)a) i (Cn + 1) — a)) 
and ( for integer values of a(m + 1)) 


m+i 
(3.6) 








(3.7) Cm S RD) az -al<sd,<1 
m` | 2a(1 — «) CENS mes 
where 
1 1 
1 | si 
(3.8) 12(1 — a)(m + 1) 


1 1 

12(m+1) 120(m+1)+1 
1 

~ 1211 —a)(m+1) +1 


dn = ex| 
Joa asm > oo. 


REMARK. Note that 8(0) = 0, (a) = B(a), and A(1) = y hold and that £ is 
linear on [0, a] and [a, 1]. Consequently (3.4) is satisfied if £ is concave. 
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Proor. By (3.4) and (2.6) we have 
B(a) — Bala) = E(B(a) - B(Z,,, «)) 
< E(B(a) - B(Z,,, 4) 
B(a) 2aB(a) — (a) — ay 
= By me a| Qa(1 =. a) (Zm, a a)), 


which implies (3.5) in view of EZ,, a = a. Furthermore, (3.6) is easily proven by 
partial integration. Finally we have, “by Stirling’s formula, 


T(m + 1) = (m+ 1)!/(m + 1) 





(3.9) 
< V2 (m + 1)" eT m+ De1/0Um+ D) 

and 

aao) Tm + Da) = (alm-+ 1))/Calm + 1) 


> va ((m + 1a) tA em Da e1/0Am+D)+1) 
Since, obviously, 
1 1 1 
12(m+1) 12(m+1)e+1 12(m+1)(1-a)}+1 


holds, (3.6), (8.9), and (3.10) together yield the right-hand side of (3.7). Another 
application of Stirling’s formula gives us the other part of this inequality. O 





<0 


Note that (3.5) with y = B(«) and (3.6) yield 








Bn) ElZme = 2h iea 
Bla) 2a (m + 1)aB((m + 1ja,(m + 1)(1 — a)) 
and that 
(3.11) r- Ffm Al L L f fafa, m, £) dë du. 


The quantity given by (8.11) will be called ep w the Dwass efficiency (for 
simulation sample size m), which gives us a lower bound for £,,(a)/f(a) under 
the conditions mentioned above. From the asymptotic approximation (3.7) to- 
gether with (3.8) we are able to tabulate the Dwaas efficiency for different values 
of a and m, presented in Table 1. They are in good accordance with values 
reported by Dwass (1957). 

The asymptotic approximation may be used for a quick determination of the 
simulation sample size m to achieve Dwass efficiency of at least e”, viz 


l 1-a 1 
2r a (1-eP?ř 


Practical consequences of these considerations and some more applied aspects are 
discussed in Jöckel (1984). 





m= 
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TABLE 1 


Tabulation of Dwass efficiency ep a based on the approximation (3.7) For 
the diguts displayed the upper and lower bounds coincide. 


a 0.01 0.02 0.025 0.05 0.10 
m= 19 0.64 0.743 
m= 39 0.64 0.736 0.815 
m= 59 0.782 0.848 
m= 99 0.63 0.732 0.829 0.881 
m = 199 0.73 0.807 0 827 0.878 0.916 
m = 299 0.777 0.841 0.900 0.931 
m = 399 0.806 0.862 0.876 0.913 0.940 
m = 499 0.825 0.876 0.922 0.947 
m = 599 0.840 0.887 0.899 0.929 0.951 
m = 699 0.852 0.896 0 934 0.955 
m = 799 0.861 0.902 0.912 0.939 0.958 
m = 899 0.869 0.907 0.942 0 960 
m = 999 0.876 0.912 0.921 0.945 0.962 


4. Pitman efficiency of MC tests. The last section has confirmed that the 
power properties of MC tests can be satisfactory. In situations where the use of 
MC tests is indicated, power calculations for the original tests are not normally 
available. In these cases a widely used and accepted concept for the comparison 
of tests is that of asymptotic relative Pitman efficiency (ARPE). The aim of this 
section is to study the ARPE of MC tests. 

Throughout this section we shall assume that we wish to test 


6 = b, versus 6 € © — {6}, 


where the parameter space © is a subset of the reals, such that the connected 
component of 6), C(§,) + {6)}. Furthermore let ($7) be a sequence of consistent, 
unbiased level — «æ tests, which are assumed to be asymptotically normal, viz. 


(4.1) Esg > (PHa) + 6), 68> 0asn-— 0, 


where ® denotes the Gaussian distribution function and @, is any sequence of the 
form 


(4.2) 6,=O%+n/¥n + o(1/vn), >0. 


The parameter ô is called the slope. 

The concept of ARPE of MC tests may be developed in a more general 
framework, © being an arbitrary topological space and (¢%) possessing different 
limiting distributions. For a more detailed discussion on this point the reader is 
referred to Jéckel (1982). Here, however, we will restrict ourselves to the most 
important case (4.1), which we shall call the normal limit case. The derivation in 
the general case follows essentially the same lines as indicated here. 

If oe denotes the MC test corresponding to $% (for simulation sample size m) 
we have for any sequence fulfilling (4.2) 


(4.3) Egon Gm, (8:0), 
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where 
Gin, a(x) = [olu + x)b(a, m, u) du. 


To calculate the ARPE of the MC tests wrt the original tests we may apply a 
result due to Rothe (1981). 

We first observe that the tests ($2) satisfy the conditions A, B, and C of 
Rothe’s paper with functions g(6) = (8 — 6)? and H(n) = Ga, Ad ° yn). Hence 
for fixed a and 8, 0 < a < £ < 1, an appeal to Theorem 3 in Rothe (1981, page 
666) yields that 





&-(B) - $-a) |’ 
eap) =| Ga) | 


is the ARPE of the MC tests wrt the original sequence of tests. This quantity 
depends on a and £, but turns out to be independent of the slope. To study the 
limiting behaviour of e,,(a, 8) (m > œ) one first observes that the limiting 
function in (4.1), considered as a function in a, fulfills the conditions of the power 
function B(a) in Section 3. Furthermore it is concave. Thus by virtue of Corollary 
3.3. and Theorem 3.2. it is easy to show that 
e,(a,B)T1 asm-— oo. 

This convergence is uniform in 8 on every compact subset of (a, 1). If simulation 
sample size m is a nondecreasing function of n and we let m = m(n) and n 
simultaneously tend to infinity then the MC test and the exact test are asymptot- 
ically equivalent. 

If we are considering two sequences of tests with slopes 6, and 6,, respectively, 
the ARPE of the corresponding MC test is (in an obvious terminology) 


emoci, Mc2( 4, B) = €mci,1( B)- €1,2° £2, mecla, B) 
=.= 57/83. 
Thus in the normal limit case the ARPE of two sequences of MC tests is the 
same as that of the corresponding original tests. 
The problem of ARPE of MC tests in the normal limit case has already been 
tackled in the literature. Hope (1968) conjectured that for fixed « > 0 there exists 
a constant y < 1, with 


Gin, a7) = (87a) + y- yn). 
This would imply that e,,(a, 8) = y”, which is obviously not the case, and thus 
demonstrates why she did not succeed in developing a satisfactory efficiency 
concept for MC tests. 
In order to achieve an asymptotic efficiency measure independent of 8 we use 
the concept of local asymptotic Pitman efficiency, slightly modifying a proposal 
by Hajek and Sidak (1977). 


DEFINITION 4.1. If e, (a, 8) denotes the ARPE of two sequences of tests 
and the limit 


e, o(a) = lim e,,a(a, B) 
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TABLE 2 
Local asymptotic relatioe Pitman efficiency of MC tests 
for simulation sample size m in the normal hmit case 








(in percent). 
a 
m 0.01 0.025 0.05 0.1 
19 81.0 86.8 
39 89.8 93.0 
99 95.6 97.1 
499 97.4 98.6 99.1 99.1 
999 98.7 99.4 99.5 T 


exists, then e, (a) is called the local asymptotic relative Pitman efficiency 
(LARPE). 


By using |’Hospital’s rule and interchanging differentiation and integration it 
is easy to show that in the normal! limit case the LARPE of the MC tests wrt the 
original tests is 


df(®~(u))b(a, m,u) du |* 
f(®-(a)) 


where f denotes the normal density. Furthermore lim,, — .€,,(«) = 1. By means 
of numerical integration and some integral transformations e„(a) is calculated in 
Table 2. 


Enla) = 
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MAXIMUM ASYMPTOTIC VARIANCES OF TRIMMED MEANS 
UNDER ASYMMETRIC CONTAMINATION’ 


By JoHN R. COLLINS 
University of Calgary 


We consider the following problem arising in robust estimation theory’ 
Find the maximum asymptotic variance of a trimmed mean used to estimate 
an unknown location parameter when the error distribution is subject to 
asymmetric contamination. The model for the error distribution is 
F = (1 — ¢)F, + eG, where F, 18 a known distnbution symmetric about 0, e is 
fixed proportion of contamination, and G is an unknown and possibly 
asymmetric distribution. We prove, under the assumption that F} has a 
symmetric unimodal density function fọ, that the maximal asymptotic vari- 
ance 1s obtained when G places mass 1 at either +o or — œ. The key idea of 
the proof 1s first to maximize the asymptotic variance subject to the side 
conditions F(a) = a and F(b) = 1 — a when a and b are given. 


1. Introduction and summary. Let X,,..., X, be a random sample from a 
distribution F(x — 6), where 0 is an unknown parameter to be estimated. Let 
T, = TLX... X,] denote the a-trimmed mean as defined, e.g., on page 58 of 
Huber (1981). Then under mild regularity conditions on F, n)/?[T, — ET.] 
converges in distribution to a normal distribution with mean 0 and variance 
V( F), where (ref. Andrews et al. (1972), pages 31 and 34): 


1 b 2 2 2 
(1) VF) = al fie ela)? ae + afla = ela)? + (6 = eCa))'], 


where 


(2) c(a) = [rar + a(a + b), 


and where a = F~ (a) and b = F- (1 — a). 

A problem arising in robust estimation theory is to evaluate the supremum of 
V( F) as F varies over distributions of the form 
(3) F= (1 — e)f + 0G, 
where J, is a fixed known distribution symmetric about 0, G is unknown, and e is 
a fixed proportion of contamination. Here the constants e and a are required to 
satisfy 0 < e < a < i in order to avoid breakdown. It is well known that when 
the unknown contaminating distribution G is restricted to be symmetric about 0, 
then T, is an unbiased estimator of 9 and V(F) is maximized by the symmetric 
distribution G which places mass } at each of + œ and — co. 
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Consider now the situation when the unknown contaminating distribution G 
in (3) is not required to be symmetric. Although T, is not an unbiased estimator 
of 0 under asymmetric contamination, the problem of maximizing V(F’) remains 
of interest. (See Section 4.9 of Huber (1981) for further motivation for this 
problem.) Huber (1981) considered the case where F, is a normal distribution and 
made the “highly plausible” conjecture that V(F) is maximized over all F of 
form (3) when G places mass 1 at either + 00 or — 00. 

In this paper we prove that Huber’s conjecture is true whenever the fixed Fy 
in (3) has a density fọ which is symmetric about 0 and unimodal. The main 
difficulty in proving the result is that the limits of integration in formula (1) 
depend on F. The device used to circumvent this difficulty is to first maximize 
V(F) over all F of form (3) subject to the side conditions F(a) =a and 
F(b) = 1 — a. A simple argument using the method of moment spaces yields the 
maximum asymptotic variance, V(a, b), subject to F(a) = a and F(b) =1- a. 
One then shows that Via, b) is maximized over all possible pairs (a, b) by the 
choice of a and b obtained by placing all the contaminating mass at + o0. 


2. Maximizing the asymptotic variance. Assume that € and a are fixed, 
with O0<e<a< 1}. Let K be a fixed distribution function with a density 
function fẹ = Fy satisfying the following two assumptions: 

ASSUMPTION 1. fọ is symmetric about 0, i.e., fo(x) = f(-x) a.e. x. 


ASSUMPTION 2. f(x) is strictly decreasing in x > 0. 


The problem is to maximize V(F'), given by (1), over all F of form (3). 
Simplification of (1) yields 


(4) V(F) = V,(F)/(1 - 2a)’, 
where 
(5) V, = VF) = ~[e(a)’] + f’x?aF + oa? + 07). 


Here c(a) is given by (2), and a and b satisfy 
(6) F(a)=a, F(b)=1-a. 


Our first step will be to maximize V, subject to a and 6 being given. That is, 
we will maximize V(F) over the convex subclass of distributions of form 
F(x) = (1 — 2)F(x) + G(x) which satisfy (6). Only pairs of values of a and b 
for which this convex subclass is nonempty will be considered. For given values of 
a and b, G(a) and G(b) are determined by 

G(a) = (a — (1 - e)F(a))/e, 


1 — G(b) = (a — (1 — €)(1 - R(d)))/e, 
so that the G-mass G(b) — G(a) is also known. 


(7) 


350 J. R. COLLINS 


We may assume that a and b satisfy 
(8) 0 <jaj <b, 
so that 1 — F(b) < F(a) and 1 — G(b) 2 G(a). The reason that there is no loss 
of generality in assuming that (8) holds is that if F = (1 — e)F,+ eG and if 
F* = (1 — &)F, + eG*, where G*(x) = 1 — G(—x) for all x, then clearly V(F) = 


V(F*) by symmetry. 
For fixed a and b, it follows from (2), (3), and (5) that 


(9) V=- |e; + e f'zda(a)| +C,+ ef'x*aG(x), 


where C, and C, are positive constants, so that V, is a simple quadratic function 
of the moments 


(10) u = f’zdGla), v= ['x aG(x). 


It is well known that the pair (u/p, v/p) can be any point in the convex hull of 
the curve S = {(x, x”): a < x < b}, where p = G(b) — G(a) is fixed by (7). The 
upper boundary of the convex hull of S is a straight line segment, and each point 
on that line segment can be realized by a distribution whose restriction to the 
interval [a, b] is supported by the pair of points {a, b}. Keeping u fixed and first 
maximizing V, relative to v, it is obvious from (9) that any distribution G 
maximizing V, must have v maximal, that is, must be supported by {a, b}. 

For fixed a and b, let the maximizing G have masses p and q at a and b, 
respectively, so that p + g = p. Then we have u = pa + qb = pa + q(b — a) 
and v = pa? + gb? = pa? + q(b? — a”). So the maximum value of V,, subject to 
(6), is the maximum value of 


Vi = Vila) = = {0 ~e) f’z dF) + eG) - G(a))a + a(b ~ a) 


(11) + a(a — b)} 


ias e) f x? dF x) + e[(G(b) — G(a))a? + a(b? - a?)| 


+a(a? + b?), 
where q ranges over [0, G(b) — G(a)], and where G(a) and G(b) are determined 
by (7). 

Partial differention of (11) relative to q and substitution of identity (2) yields: 
aV,/dqg = —2c(a)e(b — a) + e(b? — a”) 
bt+a 1 


(12) 








= 2e(b - a)(1 - 2a) f'zdEla) 


2 1 — 2a J, |, 


where F = (1 — €)Fy + e[(p — q), + qô,]. Since we also have d*V,/dq? = 
~—2e*(b — a) < 0, we need only inspect @V,/dq at 0 and p to determine whether 
V, is maximized when (i) 7 = p, (ii) q € (0, o), or (ii) q = 0. The following 
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lemma shows that, when p > 0, the possibility of V, being maximized when 
q = 0 (corresponding to having all the contaminating mass at a) is ruled out 
under our assumptions. 


LEMMA. Let F, be a fixed distribution function satisfying Assumptions 1 and 
2. Let a and b be fixed numbers which satisfy (8) and for which p = G(b) — G(a) 
(defined by (7)) satisfies p > 0. Then 3V,/ðq is nonnegative at q = 0. 


PROOF. Suppose not, i.e., suppose that dV,/dq < 0 at q = 0. By (12), this 
implies that when the restriction of the distribution to [a, b] is F = (1 — 2), + 
epô,, the average of the distribution over [a, b], namely {2x dF(x)/(F(b) — F(a)), 
is > (b+ a)/2. Then the average over [a, b] under Fy, namely 
[ox dFj(x)/(Fo(b) — F(a), must also be > (b + a)/2, since mixing F, with ô, 
can only pull the average toward the left. Now let x, = (b + a)/2 and note that 
Xo 2 0 by assumption (8). To complete the proof by contradiction, it remains to 
show that the average value over [a, b] under Fy is < xy. But this follows from 
the calculation 


[sak(x) z tof fy = [e - x9) f(x) dx + ING - x) fo(x) dx 
= [OPAL tle + t) — fo(xo — t)] dt < 0, 


since Assumptions 1 and 2 imply that f,(x — t) = f(x + t) for all t > 0 and 
x20.0 


(13) 


In view of the lemma, the maximum of V, subject to fixed a and b satisfying 
(8) occurs when either (i) q = p (all contaminating mass at b) or (ii) g € (0, p) (a 
proper mixture of mass at both a and b). We remark that calculations for the 
case when F, is the standard normal distribution show that both cases (i) and (ii) 
do in fact occur, depending on the values of a and b. 


THEOREM. Under Assumptions 1 and 2 on Fy: 
(i) The maximum value of V,(F) = (1 — 2a)°V(F) over all F = (1 — e)Fy + eG 
is 


(14) - la x e) f'z dF (x) Jalat »)| we e) f’x? dex) Ge EI 


when a =a, and b= by, where ay = Fy (a/(1 — £)) and by = Fy (1 — a)/ 
(1 — e)). 

(ii) The maximum is attained at F = (1 — e)F, + eG if and only if either G 
places mass 1 on (bj, œ] or G places mass 1 on [— œ, — bg). 


Proor. (i) Let V,(a, b) denote the maximal value of V, subject to F(a) = a 
and F(b) = 1 — a. We need to show V (a, b) < V,(ao, bo) for all possible (a, b). 
Without loss of generality, consider only pairs (a, b) which satisfy (8). Our first 
step is to show that for each fixed a, V(a, b) is maximized at the maximal 
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possible value of b (corresponding to p = G(b) — G(a) = 0); namely at b = b(a) 
satisfying 
(15) F(b) = F(a) + (1 - 2a)/(1 — e). 

For fixed a, we will show that 0V,(a, b)/db > 0 for all b. First let S be any 
interval of bs for which dV,/dq (formula (12)) is > 0 at q = p. Equivalently, S 
is an interval of bs for which 


b+a 1 


(16) 








b 

o zg | Pa) = 0, 

where F, denotes a distribution with restriction to [a,b] given by F,= 
(1 — e)F, + epd,. In view of the lemma, V,(q) is maximized at q = p for all 
beS. For b&S, V,(a, b) is obtained by substituting g = p = G(b) — G(a) 
into the right-hand side of (11). Differentiating V,(a, b) with respect to b, noting 
that d(eG(b))/db = —(1 — e)fa(b) by (7), yields (after some simplification) that 


aV,(a, b)/db = 2[e(G(b) — G(a)) + all _ | [’xaF(x) +aļfa + 6) 





(17) = 2[e(G(b) - G(a)) + a] 
x (eo 5 ; =) 4 (a — 2a) - [xara] > 0 





for all b € S, by (16). 

Next, for fixed a, let S, be any open interval of bs for which the value of q 
maximizing V,(q) in formula (11) satisfies 0 < q < p. Then in view of the lemma, 
it follows that for b € S,, the maximum value of V, is 


(18) Va, b) = - | ['=dF a) +aļa + | + ['x? dF Ax) +a(a+b)’, 


where F, = (1 — e)Fy + e(G(b) — G(a) — g)8, + eq8, on [a, b], and where q = 
q(5) is the unique solution in (0, p) of dV(q)/dq = 0. Equivalently, from (12), 
q = q(b) satisfies 

bia 1 








è 
(19) 5 on [=4F,(x) = 0. 


Differentiating (18) with respect to b on S, yields 
Sa -2| f'zar =) +a(a+t »)| 
dq 
(20) xa- 9B- ab) +b- eg ta 


+ (1 — e)(b? — a?) falb) + e(b? — a) + 2ebq + 2ab. 


Substitution of (b + a)/2 = ffx dF x) + a(a + b) (which is just (19)) into (20) 
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yields, after simplification, that 
3V (a, b) 


(21) J5 


=(b-a)(a+eqg)>0 
for all b € S,. 

From (17) and (21) it follows that, for each fixed a, V,(a, b) is maximized when 
b attains its maximum value, corresponding to p = 0. Thus V,(a), defined as the 
maximum value of V, given a, is given by (14) with b = b(a) determined by (15). 
It remains to show that V,(a) attains its maximum at the maximum possible 
value of a, namely at a = a, = Fy (a/(1 — e). 

Computation of dV,(a)/da, using from (15) that 0b/da = f,(a)/f,( 5), yields 


8V,(a)/da = -fa e e) fsto(x) PA Ore b)| 


x [a — e)fa(a)(b — a) + a(1 + fola) /fo(b))] 
+(1- e) fo(a)(b? -a°)+ 2ala T bfola)/fo(b)]. 


So to show that 3V (a)/ða = 0 for all a, completing the proof of (i), it suffices to 
show both: 


(22) 


(23) praa ofa - e) f’zh(x) de + a(a + »)| 
and 
(24) a + [bfo(a)/fo(6)] =(1- e) fxfolx) dx + a(a + b). 


1+ [f(a)/f(b)] 
Using an inequality from the proof of the lemma and using the identity (15) 
yields 
b 
b+a J x dF,(x) l-—e pb 
(25) a = 
2 Rb) Ala) 1—2a [ =ar), 


which is (23). Also (24) will follow from (23) if we can show that 











af(b) + bf(a) 
26 —— a. 
ee) hla) + fo( 5) ee 
But we have 
af.(b) + bf,(a) (b — a)( f(a) — fo(d)) 
eee eye A 


since b — a > 0 and since b 2 |a| implies that f,(b) < foala) = fola) by Assump- 
tions 1 and 2. This completes the proof of (i). 

(ii) When b > jal, it is easily seen that the inequalities (25), (26), and (27) are 
strict inequalities. Thus @V(a)/da > 0 except at the boundary case where 
ja| = b, proving that the unique maximum value of V, is given by (14) when 
a =a, and b = by. Clearly F = (1 — )F, + eG attains the maximum subject to 
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(8) if and only if G is concentrated on (b, 00]. Removal of side condition (8) 
completes the proof of (ii) by symmetry. O 


REMARK 1. For numerical values of the maximal asymptotic variance corre- 
sponding to various values of e and a when F, is the standard normal distribu- 
tion, see Exhibit 4.9.2 on page 105 of Huber (1981). 


REMARK 2. Not all F of the form F = (1 — e)Fọ + eG satisfy the regularity 
conditions under which the a-trimmed mean is asymptotically normal with 
variance V(F). For such regularity conditions, see Bickel (1965) and Stigler 
(1973). However it is clear from part (ii) of our theorem that there are suitably 
regular F'’s which attain the maximal value of V. 
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particular the present proof of the lemma is due to the referee. 
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PSEUDO MAXIMUM LIKELIHOOD ESTIMATION: 
THE ASYMPTOTIC DISTRIBUTION 


By WILLIAM R. PARKE 


University of California, Santa Barbara 


Gong and Samaniego (1981) define pseudo maximum likelihood estima- 
tion and derive the asymptotic distnbution of the resulting estimates. This 
note gives a simpler and more elegant expression for the asymptotic variance 
of a pseudo maxmum likelihood estimate. 


1. Introduction. In the presence of computational and algebraic obstacles to 
computing maximum likelihood estimates for a given model, the following pseudo 
maximum likelihood estimation procedure may be of some value. Let the likeli- 
hood function L,(@, 7) for a sample of size n be defined over two parameter 
vectors, @ and ~. Suppose that maximizing L,(6@,7) over both @ and 7 is 
difficult, but that some alternative procedure (other than maximizing L,(9, 7)) 
yields an estimate 7,. A pseudo MLE 9@,(7,) can then be computed by maximiz- 
ing L,,(@, #,) over 6. Gong and Samaniego (1981) discuss this general procedure 
and derive the asymptotic distribution of the pseudo MLE 6 ( Tn). 

This note improves upon the expression that Gong and Samaniego give for the 
asymptotic variance—covariance matrix of the pseudo MLE ô ( #,). Their expres- 
sion contains a term that is shown here to equal zero for all pseudo maximum 
likelihood estimation problems satisfying the standard regularity conditions that 
they specified. While this point could be established for each particular applica- 
tion (as Gong and Samaniego do for their signal plus noise application), the 
present proof covers the general case. 

The possible practical advantages of pseudo maximum likelihood estimation 
are realized for a diverse range of estimation problems. Gong and Samaniego use 
the technique to simplify the calculations for a signal plus noise problem. Other 
applications include estimating a linear reduction of an otherwise nonlinear least 
squares estimation problem [Durbin (1960), Wallis (1967)], pooling time series 
and cross section data [Maddala (1971)], and avoiding likelihood function singu- 
larities in estimating large economic forecasting models [Fair and Parke (1980) 
and Parke (1985)]. 


2. Asymptotic theory. The information matrix # for the vector (9, 7) can 
be partitioned as 
$ a ja a 


În Oop 
Let Z (êo To) denote log(L,(09,7)), let 1(8,7) denote n`! 3l, (0, 7)/d0 
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evaluated at (9), 7o), and let the matrix 2 be defined by 


15( 93 To) = Za Zhe 
f| T, — To an = Za Zoo |} 


Gong and Samaniego show that, for the case of scalars 9 and 7, 


where 


o? = 1/8, + (90/911) Za = 2( 012/97 Dro 
In fact, È. = 0, and 


as fa [ Me) = 4). P Eo 
To 29087) Za 


The proof of this result is given in Section 3. 

The simplicity of this result arises from the surprising fact that È, equals zero 
for any consistent estimate 7,. In general, statistics computed using the same 
data, as are /,(6), To) and @,, will be asymptotically correlated. A general result 
that this correlation is zero eliminates the potentially difficult task of deriving an 
expression for X}, for a particular estimation problem. 

The asymptotic distributions of the pseudo MLE 6, (z,) and the MLE @ ht m) 
can be compared by inverting the information matrix ® to obtain 


(2 2) Tea = E > Nio en + 979 225,997) — 3579 D4 
tt, — To se ta -Ehin t = 


where 2%, = (Poo — $;47;'012) | is the asymptotic variance of the MLE @,,. The 
expression (2.1) can in fact be obtained algebraically from (2.2) by simply 
substituting the asymptotic variance È, of the alternative estimate 7, for the 
asymptotic variance of the MLE 7,. In both (2.1) and (2.2), the term ĝa is the 
asymptotic variance of the estimate @,(7)) that could be computed if a) were 
known. The term or 3 122%,9,,0;;' in (2.2) is the minimum possible additional 
asymptotic variance in an estimate of 4 if 7 is not known. This minimum is 
attained for the MLE 6 (ĉn) computed jointly with the MLE #,,. For the pseudo 
MLE 6,(z,), the corresponding term 919 22204 Fi in (2.1) is the asymptotic 
variance that can be attributed to using the estimate 7, of 7). The asymptotic 
inefficiency of the pseudo MLE 6 n(7,) relative to the MLE 6 ) (#,,) is thus given 
by oy Pulte - 2%)9, 0. The pseudo MLE 6 nln ) is asymptotically efficient 

#,. equals zero (because the MLEs (KE n and @, are asymptotically uncorre- 
lated) or if #, is asymptotically as re as the MLE În- 


3. Proof of (2.1). Consider the hypothetical estimate 8 (1%) that could be 
computed if 7) were known. Under the regularity conditions given in Gong and 
Samaniego (1981), the equation 


1,(8, To) =0 
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has a consistent root 6K 19). Expanding i,(8, (To), To) about 9 yields 
Vn (ê (To) — 99) = —¥n enl Oo, %) + 0,(1). 
Differentiating with respect to 7 yields 
(3.1) vn 06,(1)/dn = Vnt ta + 0,(1), 


where 06,(7)/d is evaluated at To., 
The asymptotic distribution of 6 ) (#,) then follows from the Taylor series 
approximation: 


Vn (6,( #,) ra bo) E Vn (8,( 1) = bo) + vn 06,( 7) /dn ( CGA = To) + 0,(1). 
Using (3.1), 
(3.2) Vn (8 wn) ~ ba) SE vn (8,( 1%) = 9) ote Vne Oal in -= m) + 0,(1). 


Pierce (1982) shows that vn (ê n( To) — b) and ¥n(#, — To) must be asymptoti- 
cally independent because 6 } (7) is asymptotically efficient and vn (ñ, — 7) has 
asymptotic mean zero. (Hausman (1978) gives a similar application of asymptotic 
efficiency to establish asymptotic independence.) The conclusion (2.1) follows 
from this last result and (3.2). 0 


Acknowledgment. The author would like to thank an associate editor for 
very helpful comments on an earlier version of this paper. 
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CORRECTION 


ADJUSTMENT BY MINIMUM DISCRIMINANT INFORMATION 


By SHELBY J. HABERMAN 
Annals of Statistics (1984) 12 971-988 


Anne Sheehy of the Department of Statistics, University of Washington, 
noted a number of corrections in this article. The corrections are listed below by 


page. 


Page 975. 
Page 977. 


Page 981. 


Page 982. 


Page 983. 


Page 984. 
Page 985. 


Page 986. 


The first summation in (1.23) should be deleted. 

In the display before (1.35), T should be preceded by a comma rather 
than by a right parenthesis. 

In the proof of Theorem 1, P(U) should be PT-XU) and P(U,) 
should be PT~*(U,). 

On the second line, the numeral 7 should be omitted. In the state- 
ments of Theorem 3 and Lemma 1, N is an open neighborhood of 0 
rather than of 8. In the second to last line of the proof of Theorem 3, 
¢,(9) should appear rather than /n(6). 

Theorem 4 should refer to Lemma 1 rather than to Lemma 2. In the 
proof of Corollary 1, Theorem 4 rather than Theorem 2 applies. 

In (4.2), the normal mean should be 0 rather than 8. 

In (4.13), x, is defined for k from 1 to b = dim H rather than for k 
from 1 to b. In (4.14), summations should be from 1 to b’ rather than 
from 1 to b. The summands of (4.14) should be squared. In (4.15), 
(x, $x) should be (x, $*x). 

In (4.17), @p should be replaced by yp. In (4.19), X should be replaced 
by X,. In (4.20), F{ D) should be replaced twice by F),(D). 
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CORRECTION 


SECOND ORDER APPROXIMATION TO THE RISK OF A 
SEQUENTIAL PROCEDURE 
By ADAM T. MARTINSEK 
Annals of Statistics (1983) 11 827-836 


In the Remark on page 834, the two asymptotic approximations for the regret 
are incorrect. The first one should be 


2B + (82/4 — B)E{(Z? — 1)"\ + (8? + B)E*%(Z?) + o(1) 
and the second should be 
2B + (B2/4 ~ B)E{(Z? — 1)°} + (6? + B)E*(Z?) + (2 + B) + o(1). 


As noted by Woodroofe (1985), the second order approximation given in Theorem 
1 of my paper (corresponding to 8 = 1) coincides with his asymptotic lower 
bound for M,(4), as defined in his paper, in the nonparametric case. The first 
corrected expression above agrees with Woodroofe’s asymptotic lower bound for 
all B > 0. 
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CORRECTION 


ADAPTIVE ESTIMATORS FOR SIMULTANEOUS ESTIMATION 
OF POISSON MEANS 


By H. M. Hupson 
Annals of Statistics (1985) 13 246-261 


A number of errors occur in the derivation on page 251 of an approximation to 
the risk function. I am grateful to Ms. Diane Duffy and Professor Thomas 
Santner for drawing my attention to them. 

The following changes are required to correct the derivation. Delete lines 2 and 
3 and replace by: “Then, since S = H’(I — Q)H, 

S, = [H - (1/X,)e]'U- Q)[H - (1/X,)e,] 

=S- (2/X,)(H,~ A) + (1/X7)0 — qu) 

E S{1 x (1/X,S)[2(H, > Ê,) E (1/X,)(1 i q..)| } n 
Consequential changes occur in the exponents of terms involving (1 — q,,) in lines 
6 and 14. The expressions on lines 10 and 13 should be preceded by a negative 
sign, and the second term of the expression on line 11 should begin RX, not 
R/X,. The argument from line 9 applies for X, > 1. Thus the final sentence of 
the paragraph should read: “The last term is negligible, by assumption. A similar 
argument is required when X, = 1, and this provides [1 — (1/2R)](1 — q,,) — 
2(H, — H,)*/S as the two nonnegligible terms in the inequality above. Summing 
nonnegligible terms, for i = 1,2,..., p, we obtain (R? — N,)/S as the estimated 
risk reduction. Here N, is the number of cells containing the count 1.” 
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